blog.lazkani.io/content/posts/building-up-simple-monitoring-on-healthchecks.md

136 lines
4.5 KiB
Markdown

+++
title = "Building up simple monitoring on Healthchecks"
author = ["Elia el Lazkani"]
date = 2020-02-11
lastmod = 2020-02-11
tags = ["healthchecks", "cron", "curl"]
categories = ["monitoring"]
draft = false
+++
I talked previously in "[Simple cron monitoring with HealthChecks]({{< relref "simple-cron-monitoring-with-healthchecks" >}})" about deploying my own simple monitoring system.
Now that it's up, I'm only using it for my backups. That's a good use, for sure, but I know I can do better.
So I went digging.
<!--more-->
## Introduction {#introduction}
I host a list of services, some are public like my blog while others private.
These services are not critical, some can be down for short periods of time.
Some services might even be down for longer periods without causing any loss in functionality.
That being said, I'm a _DevOps engineer_. That means, I need to know.
Yea, it doesn't mean I'll do something about it right away, but I'd like to be in the know.
Which got me thinking...
## Healthchecks Endpoints {#healthchecks-endpoints}
Watching **borg** use its _healthchecks_ hook opened my eyes on another functionality of **Healthchecks**.
It seems that if you ping
```text
https://healthchecks.example.com/ping/84b2a834-02f5-524f-4c27-a2f24562b219/start
```
It will start a counter that will measure the time until you ping
```text
https://healthchecks.example.com/ping/84b2a834-02f5-524f-4c27-a2f24562b219
```
This way, you can find out how long it is taking you to check on the status of a service. Or maybe, how long a service is taking to backup.
It turns out that _healthchecks_ also offers a different endpoint to ping. You can report a failure straight away by pinging
```text
https://healthchecks.example.com/ping/84b2a834-02f5-524f-4c27-a2f24562b219/fail
```
This way, you do not have to wait until the time expires before you get notified of a failure.
With those pieces of knowledge, we can do a lot.
## A lot ? {#a-lot}
Yes, a lot...
Let's put what we have learned so far into action.
```sh
#!/bin/bash
WEB_HOST=$1
CHECK_ID=$2
HEALTHCHECKS_HOST="https://healthchecks.example.com/ping"
curl -fsS --retry 3 "${HEALTHCHECKS_HOST}/${CHECK_ID}/start" > /dev/null
OUTPUT=`curl -sS "${WEB_HOST}"`
STATUS=$?
if [[ $STATUS -eq 0 ]]; then
curl -fsS --retry 3 "${HEALTHCHECKS_HOST}/${CHECK_ID}" > /dev/null
else
curl -fsS --retry 3 "${HEALTHCHECKS_HOST}/${CHECK_ID}/fail" > /dev/null
fi
```
We start by defining a few variables for the website hostname to monitor, the check ID provided by _healthchecks_ and finally the _healthchecks_ base link for the monitors.
Once those are set, we simply use `curl` with a couple of special flags to make sure that it fails properly if something goes wrong.
We start the _healthchecks_ timer, run the website check and either call the passing or the failing _healthchecks_ endpoint depending on the outcomes.
```text
$ chmod +x https_healthchecks_monitor.sh
$ ./https_healthchecks_monitor.sh https://healthchecks.example.com 84b2a834-02f5-524f-4c27-a2f24562b219
```
Test it out.
## Okay, that's nice but now what ! {#okay-that-s-nice-but-now-what}
Now, let's hook it up to our cron.
Start with `crontab -e` which should open your favorite text editor.
Then create a cron entry (a new line) like the following:
```text
*/15 * * * * /path/to/https_healthchecks_monitor.sh https://healthchecks.example.com 84b2a834-02f5-524f-4c27-a2f24562b219
```
This will run the script every 15 minutes. Make sure that your timeout is 15 minutes for this check, with a grace period of 5 minutes.
That configuration will guarantee that you will get notified 20 minutes after any failure, at the worst.
Be aware, I said any failure.
Getting notified does not guarantee that your website is down.
It can only guarantee that _healthchecks_ wasn't pinged on time.
Getting notified covers a bunch of cases. Some of them are:
- The server running the cron is down
- The cron services is not running
- The server running the cron lost internet access
- Your certificate expired
- Your website is down
You can create checks to cover most of these if you care to make it a full monitoring system.
If you want to go that far, maybe you should invest in a monitoring system with more features.
## Conclusion {#conclusion}
Don't judge something by its simplicity. Somethings, out of simple components tied together you can make something interesting and useful.
With a little of scripting, couple of commands and the power of cron we were able to make _healthchecks_ monitor our websites.