blog.lazkani.io/content/posts/upgrade-your-monitoring-setup-with-prometheus.md

264 lines
9.1 KiB
Markdown

+++
title = "Upgrade your monitoring setup with Prometheus"
author = ["Elia el Lazkani"]
date = 2021-09-17
lastmod = 2021-09-17
tags = ["prometheus", "metrics", "container"]
categories = ["monitoring"]
draft = false
+++
After running simple monitoring for quite a while, I decided to upgrade my
setup. It is about time to get some real metric gathering to see what's going
on. It's also time to get some proper monitoring setup.
There are a lot of options in this field and I should, probably, write a blog
post on my views on the topic. For this experiment, on the other hand, the
solution is already pre-chosen. We'll be running Prometheus.
<!--more-->
## Prometheus {#prometheus}
To answer the question, _what is Prometheus?_, we'll rip a page out of the
Prometheus [docs](https://prometheus.io/docs/introduction/overview/).
> Prometheus is an open-source systems monitoring and alerting toolkit originally
> built at SoundCloud. Since its inception in 2012, many companies and
> organizations have adopted Prometheus, and the project has a very active
> developer and user community. It is now a standalone open source project and
> maintained independently of any company. To emphasize this, and to clarify the
> project's governance structure, Prometheus joined the Cloud Native Computing
> Foundation in 2016 as the second hosted project, after Kubernetes.
>
> Prometheus collects and stores its metrics as time series data, i.e. metrics
> information is stored with the timestamp at which it was recorded, alongside
> optional key-value pairs called labels.
let's decypher all this jargon down to plain English. In simple terms,
Prometheus is a system that scrape metrics, from your services and applications,
and stores those metrics, in a time series database, ready to serve back again
when queried.
Prometheus also offers a way to create rules on those metrics to alert you when
something goes wrong. Combined with [_Alertmanager_](https://prometheus.io/docs/alerting/latest/alertmanager/), you got yourself a full
monitoring system.
## Configuration {#configuration}
Now that we briefly touched on a _few_ features of **Prometheus** and before we
can deploy, we need to write our configuration.
This is an example of a bare configuration.
<a id="code-snippet--prometheus-scraping-config"></a>
```yaml
scrape_configs:
- job_name: prometheus
scrape_interval: 30s
static_configs:
- targets:
- prometheus:9090
```
This will make Prometheus scrape itself every 30 seconds for metrics. At least
you get _some_ metrics to query later. If you want the full experience, I would
suggest you enable _Prometheus metrics_ for your services. Consult the docs of
the project to see if and how it can expose metrics for _Prometheus_ to scrape,
then add the scrape endpoint to your configuration as shown above.
Here's a an example of a couple more, _well known_, projects; [_Alertmanager_](https://prometheus.io/docs/alerting/latest/alertmanager/) and
[_node exporter_](https://github.com/prometheus/node%5Fexporter).
<a id="code-snippet--prometheus-example-scraping-config"></a>
```yaml
- job_name: alertmanager
scrape_interval: 30s
static_configs:
- targets:
- alertmanager:9093
- job_name: node-exporter
scrape_interval: 30s
static_configs:
- targets:
- node-exporter:9100
```
A wider [list of exporters](https://prometheus.io/docs/instrumenting/exporters/) can be found on the Prometheus docs.
## Deployment {#deployment}
Now that we got ourselves a cofniguration, let's deploy **Prometheus**.
Luckily for us, Prometheus comes containerized and ready to deploy. We'll be
using `docker-compose` in this example to make it easier to translate later to
other types of deployments.
<div class="admonition note">
<p class="admonition-title">Note</p>
I'm still running on `2.x` API version. I know I need to upgrade to a newer
version but that's a bit of networking work. It's an ongoing work.
</div>
The `docker-compose` file should look like the following.
```yaml
---
version: '2.3'
services:
prometheus:
image: quay.io/prometheus/prometheus:v2.27.0
container_name: prometheus
mem_limit: 400m
mem_reservation: 300m
restart: unless-stopped
command:
- --config.file=/etc/prometheus/prometheus.yml
- --web.external-url=http://prometheus.localhost/
volumes:
- "./prometheus/:/etc/prometheus/:ro"
ports:
- "80:9090"
```
A few things to **note**, especially for the new container crowd. The container
image **version** is explicitly specified, do **not** use `latest` in production.
To make sure I don't overload my host, I set memory limits. I don't mind if it
goes down, this is a PoC (Proof of Concept) for the time being. In your case,
you might want to choose higher limits to give it more room to breath. When the
memory limit is reached, the container will be killed with _Out Of Memory_
error.
In the **command** section, I specify the _external url_ for Prometheus to
redirect me correctly. This is what Prometheus thinks its own hostname is. I
also specify the configuration file, previously written, which I mount as
_read-only_ in the **volumes** section.
Finally, we need to port-forward `9090` to our hosts' `80` if possible to access
**Prometheus**. Otherwise, figure out a way to route it properly. This is a local
installation, which is suggested by the Prometheus _hostname_.
If you made it so far, you should be able to run this with no issues.
```bash
docker-compose up -d
```
## Prometheus Rules {#prometheus-rules}
**Prometheus** supports **two** types of rules; recording and alerting. Let's expand
a little bit on those two concepts.
### Recording Rules {#recording-rules}
First, let's start off with [recording rules](https://prometheus.io/docs/prometheus/latest/configuration/recording%5Frules/). I don't think I can explain it
better than the **Prometheus** documentation which says.
> Recording rules allow you to precompute frequently needed or computationally
> expensive expressions and save their result as a new set of time series.
> Querying the precomputed result will then often be much faster than executing
> the original expression every time it is needed. This is especially useful for
> dashboards, which need to query the same expression repeatedly every time they
> refresh.
Sounds pretty simple right ? Well it is. Unfortunately, I haven't needed to
create recording rules yet for my setup so I'll forgo this step.
### Alerting Rules {#alerting-rules}
As the name suggests, [alerting rules](https://prometheus.io/docs/prometheus/latest/configuration/alerting%5Frules/#alerting-rules) allow you to define conditional expressions
based on metrics which will trigger notifications to alert you.
This is a very simple example of an _alert rule_ that monitors all the endpoints
scraped by _Prometheus_ to see if any of them is down. If this expression return
a result, an alert will fire from _Prometheus_.
```yaml
groups:
- name: Instance down
rules:
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
```
To be able to add this alert to **Prometheus**, we need to save it in a
`rules.yml` file and then include it in the **Prometheus** configuration as follows.
<a id="code-snippet--prometheus-rule-files-config"></a>
```yaml
rule_files:
- "rules.yml"
```
Making the configuration intiretly as follows.
```yaml
rule_files:
- "rules.yml"
scrape_configs:
- job_name: prometheus
scrape_interval: 30s
static_configs:
- targets:
- prometheus:9090
- job_name: alertmanager
scrape_interval: 30s
static_configs:
- targets:
- alertmanager:9093
- job_name: node-exporter
scrape_interval: 30s
static_configs:
- targets:
- node-exporter:9100
```
At this point, make sure everything is mounted into the container properly and
rerun your **Prometheus**.
## Prometheus UI {#prometheus-ui}
Congratulations if you've made it so far. If you visit <http://localhost/> at
stage you should get to Prometheus where you can query your metrics.
{{< figure src="/ox-hugo/01-prometheus-overview.png" caption="Figure 1: Prometheus overview" target="_blank" link="/ox-hugo/01-prometheus-overview.png" >}}
You can get all sorts of information under the _status_ drop-down menu.
{{< figure src="/ox-hugo/02-prometheus-status-drop-down-menu.png" caption="Figure 2: Prometheus Status drop-down menu" target="_blank" link="/ox-hugo/02-prometheus-status-drop-down-menu.png" >}}
## Conclusion {#conclusion}
As you can see, deploying **Prometheus** is not too hard. If you're running
_Kubernetes_, make sure you use the operator. It will make your life a lot
easier in all sorts of things.
Take your time to familiarise yourself with **Prometheus** and consult the
documentation as much as possible. It is well written and in most cases your
best friend. Figure out different ways to create rules for recording and
alerting. Most people at this stage deploy **Grafana** to start visualizing their
metrics. Well... Not in this blog post we ain't !
I hope you enjoy playing around with **Prometheus** and until the next post.