137 lines
4.5 KiB
Markdown
137 lines
4.5 KiB
Markdown
|
+++
|
||
|
title = "Building up simple monitoring on Healthchecks"
|
||
|
author = ["Elia el Lazkani"]
|
||
|
date = 2020-02-11T21:00:00+01:00
|
||
|
lastmod = 2021-06-28T00:01:20+02:00
|
||
|
tags = ["healthchecks", "cron", "curl"]
|
||
|
categories = ["monitoring"]
|
||
|
draft = false
|
||
|
+++
|
||
|
|
||
|
I talked previously in "[Simple cron monitoring with HealthChecks]({{< relref "simple-cron-monitoring-with-healthchecks" >}})" about deploying my own simple monitoring system.
|
||
|
|
||
|
Now that it's up, I'm only using it for my backups. That's a good use, for sure, but I know I can do better.
|
||
|
|
||
|
So I went digging.
|
||
|
|
||
|
<!--more-->
|
||
|
|
||
|
|
||
|
## Introduction {#introduction}
|
||
|
|
||
|
I host a list of services, some are public like my blog while others private.
|
||
|
These services are not critical, some can be down for short periods of time.
|
||
|
Some services might even be down for longer periods without causing any loss in functionality.
|
||
|
|
||
|
That being said, I'm a _DevOps engineer_. That means, I need to know.
|
||
|
|
||
|
Yea, it doesn't mean I'll do something about it right away, but I'd like to be in the know.
|
||
|
|
||
|
Which got me thinking...
|
||
|
|
||
|
|
||
|
## Healthchecks Endpoints {#healthchecks-endpoints}
|
||
|
|
||
|
Watching **borg** use its _healthchecks_ hook opened my eyes on another functionality of **Healthchecks**.
|
||
|
|
||
|
It seems that if you ping
|
||
|
|
||
|
```text
|
||
|
https://healthchecks.example.com/ping/84b2a834-02f5-524f-4c27-a2f24562b219/start
|
||
|
```
|
||
|
|
||
|
It will start a counter that will measure the time until you ping
|
||
|
|
||
|
```text
|
||
|
https://healthchecks.example.com/ping/84b2a834-02f5-524f-4c27-a2f24562b219
|
||
|
```
|
||
|
|
||
|
This way, you can find out how long it is taking you to check on the status of a service. Or maybe, how long a service is taking to backup.
|
||
|
|
||
|
It turns out that _healthchecks_ also offers a different endpoint to ping. You can report a failure straight away by pinging
|
||
|
|
||
|
```text
|
||
|
https://healthchecks.example.com/ping/84b2a834-02f5-524f-4c27-a2f24562b219/fail
|
||
|
```
|
||
|
|
||
|
This way, you do not have to wait until the time expires before you get notified of a failure.
|
||
|
|
||
|
With those pieces of knowledge, we can do a lot.
|
||
|
|
||
|
|
||
|
## A lot ? {#a-lot}
|
||
|
|
||
|
Yes, a lot...
|
||
|
|
||
|
Let's put what we have learned so far into action.
|
||
|
|
||
|
```sh
|
||
|
#!/bin/bash
|
||
|
|
||
|
WEB_HOST=$1
|
||
|
CHECK_ID=$2
|
||
|
|
||
|
HEALTHCHECKS_HOST="https://healthchecks.example.com/ping"
|
||
|
|
||
|
curl -fsS --retry 3 "${HEALTHCHECKS_HOST}/${CHECK_ID}/start" > /dev/null
|
||
|
|
||
|
OUTPUT=`curl -sS "${WEB_HOST}"`
|
||
|
STATUS=$?
|
||
|
|
||
|
if [[ $STATUS -eq 0 ]]; then
|
||
|
curl -fsS --retry 3 "${HEALTHCHECKS_HOST}/${CHECK_ID}" > /dev/null
|
||
|
else
|
||
|
curl -fsS --retry 3 "${HEALTHCHECKS_HOST}/${CHECK_ID}/fail" > /dev/null
|
||
|
fi
|
||
|
```
|
||
|
|
||
|
We start by defining a few variables for the website hostname to monitor, the check ID provided by _healthchecks_ and finally the _healthchecks_ base link for the monitors.
|
||
|
|
||
|
Once those are set, we simply use `curl` with a couple of special flags to make sure that it fails properly if something goes wrong.
|
||
|
|
||
|
We start the _healthchecks_ timer, run the website check and either call the passing or the failing _healthchecks_ endpoint depending on the outcomes.
|
||
|
|
||
|
```text
|
||
|
$ chmod +x https_healthchecks_monitor.sh
|
||
|
$ ./https_healthchecks_monitor.sh https://healthchecks.example.com 84b2a834-02f5-524f-4c27-a2f24562b219
|
||
|
```
|
||
|
|
||
|
Test it out.
|
||
|
|
||
|
|
||
|
## Okay, that's nice but now what ! {#okay-that-s-nice-but-now-what}
|
||
|
|
||
|
Now, let's hook it up to our cron.
|
||
|
|
||
|
Start with `crontab -e` which should open your favorite text editor.
|
||
|
|
||
|
Then create a cron entry (a new line) like the following:
|
||
|
|
||
|
```text
|
||
|
*/15 * * * * /path/to/https_healthchecks_monitor.sh https://healthchecks.example.com 84b2a834-02f5-524f-4c27-a2f24562b219
|
||
|
```
|
||
|
|
||
|
This will run the script every 15 minutes. Make sure that your timeout is 15 minutes for this check, with a grace period of 5 minutes.
|
||
|
That configuration will guarantee that you will get notified 20 minutes after any failure, at the worst.
|
||
|
|
||
|
Be aware, I said any failure.
|
||
|
Getting notified does not guarantee that your website is down.
|
||
|
It can only guarantee that _healthchecks_ wasn't pinged on time.
|
||
|
|
||
|
Getting notified covers a bunch of cases. Some of them are:
|
||
|
|
||
|
- The server running the cron is down
|
||
|
- The cron services is not running
|
||
|
- The server running the cron lost internet access
|
||
|
- Your certificate expired
|
||
|
- Your website is down
|
||
|
|
||
|
You can create checks to cover most of these if you care to make it a full monitoring system.
|
||
|
If you want to go that far, maybe you should invest in a monitoring system with more features.
|
||
|
|
||
|
|
||
|
## Conclusion {#conclusion}
|
||
|
|
||
|
Don't judge something by its simplicity. Somethings, out of simple components tied together you can make something interesting and useful.
|
||
|
With a little of scripting, couple of commands and the power of cron we were able to make _healthchecks_ monitor our websites.
|