blog.lazkani.io-20200902-hi.../posts/monitoring/building-up-simple-monitoring-on-healthchecks.org

#+BEGIN_COMMENT
.. title: Building up simple monitoring on Healthchecks
.. date: 2020-02-11
.. slug: building-up-simple-monitoring-on-healthchecks
.. updated: 2020-02-11
.. status: published
.. tags: monitoring, healthchecks, cron, curl
.. category: monitoring
.. authors: Elia el Lazkani
.. description:
.. type: text
#+END_COMMENT

I talked previously in "{{% doc %}}simple-cron-monitoring-with-healthchecks{{% /doc %}}" about deploying my own simple monitoring system.

Now that it's up, I'm only using it for my backups. That's a good use, for sure, but I know I can do better.

So I went digging.

{{{TEASER_END}}}

* Introduction
I host a list of services, some are public like my blog while others private.
These services are not critical, some can be down for short periods of time.
Some services might even be down for longer periods without causing any loss in functionality.

That being said, I'm a /DevOps engineer/. That means, I need to know.

Yea, it doesn't mean I'll do something about it right away, but I'd like to be in the know.

Which got me thinking...

* Healthchecks Endpoints
Watching *borg* use its /healthchecks/ hook opened my eyes on another functionality of *Healthchecks*.

It seems that if you ping
#+BEGIN_EXAMPLE
https://healthchecks.example.com/ping/84b2a834-02f5-524f-4c27-a2f24562b219/start
#+END_EXAMPLE

It will start a counter that will measure the time until you ping
#+BEGIN_EXAMPLE
https://healthchecks.example.com/ping/84b2a834-02f5-524f-4c27-a2f24562b219
#+END_EXAMPLE

This way, you can find out how long it is taking you to check on the status of a service. Or maybe, how long a service is taking to backup.

It turns out that /healthchecks/ also offers a different endpoint to ping. You can report a failure straight away by pinging

#+BEGIN_EXAMPLE
https://healthchecks.example.com/ping/84b2a834-02f5-524f-4c27-a2f24562b219/fail
#+END_EXAMPLE

This way, you do not have to wait until the time expires before you get notified of a failure.

With those pieces of knowledge, we can do a lot.

* A lot ?
Yes, a lot...

Let's put what we have learned so far into action.

#+BEGIN_SRC sh :noeval
#!/bin/bash

WEB_HOST=$1
CHECK_ID=$2

HEALTHCHECKS_HOST="https://healthchecks.example.com/ping"

curl -fsS --retry 3 "${HEALTHCHECKS_HOST}/${CHECK_ID}/start" > /dev/null

OUTPUT=`curl -sS "${WEB_HOST}"`
STATUS=$?

if [[ $STATUS -eq 0 ]]; then
    curl -fsS --retry 3 "${HEALTHCHECKS_HOST}/${CHECK_ID}" > /dev/null
else
    curl -fsS --retry 3 "${HEALTHCHECKS_HOST}/${CHECK_ID}/fail" > /dev/null
fi
#+END_SRC

We start by defining a few variables for the website hostname to monitor, the check ID provided by /healthchecks/ and finally the /healthchecks/ base link for the monitors.

Once those are set, we simply use =curl= with a couple of special flags to make sure that it fails properly if something goes wrong.

We start the /healthchecks/ timer, run the website check and either call the passing or the failing /healthchecks/ endpoint depending on the outcomes.

#+BEGIN_EXAMPLE
 $ chmod +x https_healthchecks_monitor.sh
 $ ./https_healthchecks_monitor.sh https://healthchecks.example.com 84b2a834-02f5-524f-4c27-a2f24562b219
#+END_EXAMPLE

Test it out.

* Okay, that's nice but now what !
Now, let's hook it up to our cron.

Start with =crontab -e= which should open your favorite text editor.

Then create a cron entry (a new line) like the following:

#+BEGIN_EXAMPLE
 */15 * * * * /path/to/https_healthchecks_monitor.sh https://healthchecks.example.com 84b2a834-02f5-524f-4c27-a2f24562b219
#+END_EXAMPLE

This will run the script every 15 minutes. Make sure that your timeout is 15 minutes for this check, with a grace period of 5 minutes.
That configuration will guarantee that you will get notified 20 minutes after any failure, at the worst.

Be aware, I said any failure.
Getting notified does not guarantee that your website is down.
It can only guarantee that /healthchecks/ wasn't pinged on time.

Getting notified covers a bunch of cases. Some of them are:
  - The server running the cron is down
  - The cron services is not running
  - The server running the cron lost internet access
  - Your certificate expired
  - Your website is down

You can create checks to cover most of these if you care to make it a full monitoring system.
If you want to go that far, maybe you should invest in a monitoring system with more features.

* Conclusion
Don't judge something by its simplicity. Somethings, out of simple components tied together you can make something interesting and useful.
With a little of scripting, couple of commands and the power of cron we were able to make /healthchecks/ monitor our websites.