blog.lazkani.io/content/posts/building-up-simple-monitoring-on-healthchecks.md

4.5 KiB

+++ title = "Building up simple monitoring on Healthchecks" author = ["Elia el Lazkani"] date = 2020-02-11T21:00:00+01:00 lastmod = 2021-06-28T00:01:20+02:00 tags = ["healthchecks", "cron", "curl"] categories = ["monitoring"] draft = false +++

I talked previously in "[Simple cron monitoring with HealthChecks]({{< relref "simple-cron-monitoring-with-healthchecks" >}})" about deploying my own simple monitoring system.

Now that it's up, I'm only using it for my backups. That's a good use, for sure, but I know I can do better.

So I went digging.

Introduction

I host a list of services, some are public like my blog while others private. These services are not critical, some can be down for short periods of time. Some services might even be down for longer periods without causing any loss in functionality.

That being said, I'm a DevOps engineer. That means, I need to know.

Yea, it doesn't mean I'll do something about it right away, but I'd like to be in the know.

Which got me thinking...

Healthchecks Endpoints

Watching borg use its healthchecks hook opened my eyes on another functionality of Healthchecks.

It seems that if you ping

https://healthchecks.example.com/ping/84b2a834-02f5-524f-4c27-a2f24562b219/start

It will start a counter that will measure the time until you ping

https://healthchecks.example.com/ping/84b2a834-02f5-524f-4c27-a2f24562b219

This way, you can find out how long it is taking you to check on the status of a service. Or maybe, how long a service is taking to backup.

It turns out that healthchecks also offers a different endpoint to ping. You can report a failure straight away by pinging

https://healthchecks.example.com/ping/84b2a834-02f5-524f-4c27-a2f24562b219/fail

This way, you do not have to wait until the time expires before you get notified of a failure.

With those pieces of knowledge, we can do a lot.

A lot ?

Yes, a lot...

Let's put what we have learned so far into action.

#!/bin/bash

WEB_HOST=$1
CHECK_ID=$2

HEALTHCHECKS_HOST="https://healthchecks.example.com/ping"

curl -fsS --retry 3 "${HEALTHCHECKS_HOST}/${CHECK_ID}/start" > /dev/null

OUTPUT=`curl -sS "${WEB_HOST}"`
STATUS=$?

if [[ $STATUS -eq 0 ]]; then
    curl -fsS --retry 3 "${HEALTHCHECKS_HOST}/${CHECK_ID}" > /dev/null
else
    curl -fsS --retry 3 "${HEALTHCHECKS_HOST}/${CHECK_ID}/fail" > /dev/null
fi

We start by defining a few variables for the website hostname to monitor, the check ID provided by healthchecks and finally the healthchecks base link for the monitors.

Once those are set, we simply use curl with a couple of special flags to make sure that it fails properly if something goes wrong.

We start the healthchecks timer, run the website check and either call the passing or the failing healthchecks endpoint depending on the outcomes.

 $ chmod +x https_healthchecks_monitor.sh
 $ ./https_healthchecks_monitor.sh https://healthchecks.example.com 84b2a834-02f5-524f-4c27-a2f24562b219

Test it out.

Okay, that's nice but now what !

Now, let's hook it up to our cron.

Start with crontab -e which should open your favorite text editor.

Then create a cron entry (a new line) like the following:

 */15 * * * * /path/to/https_healthchecks_monitor.sh https://healthchecks.example.com 84b2a834-02f5-524f-4c27-a2f24562b219

This will run the script every 15 minutes. Make sure that your timeout is 15 minutes for this check, with a grace period of 5 minutes. That configuration will guarantee that you will get notified 20 minutes after any failure, at the worst.

Be aware, I said any failure. Getting notified does not guarantee that your website is down. It can only guarantee that healthchecks wasn't pinged on time.

Getting notified covers a bunch of cases. Some of them are:

  • The server running the cron is down
  • The cron services is not running
  • The server running the cron lost internet access
  • Your certificate expired
  • Your website is down

You can create checks to cover most of these if you care to make it a full monitoring system. If you want to go that far, maybe you should invest in a monitoring system with more features.

Conclusion

Don't judge something by its simplicity. Somethings, out of simple components tied together you can make something interesting and useful. With a little of scripting, couple of commands and the power of cron we were able to make healthchecks monitor our websites.