126 lines
4.6 KiB
Org Mode
126 lines
4.6 KiB
Org Mode
#+BEGIN_COMMENT
|
|
.. title: Building up simple monitoring on Healthchecks
|
|
.. date: 2020-02-11
|
|
.. slug: building-up-simple-monitoring-on-healthchecks
|
|
.. updated: 2020-02-11
|
|
.. status: published
|
|
.. tags: monitoring, healthchecks, cron, curl
|
|
.. category: monitoring
|
|
.. authors: Elia el Lazkani
|
|
.. description:
|
|
.. type: text
|
|
#+END_COMMENT
|
|
|
|
I talked previously in "{{% doc %}}simple-cron-monitoring-with-healthchecks{{% /doc %}}" about deploying my own simple monitoring system.
|
|
|
|
Now that it's up, I'm only using it for my backups. That's a good use, for sure, but I know I can do better.
|
|
|
|
So I went digging.
|
|
|
|
{{{TEASER_END}}}
|
|
|
|
* Introduction
|
|
I host a list of services, some are public like my blog while others private.
|
|
These services are not critical, some can be down for short periods of time.
|
|
Some services might even be down for longer periods without causing any loss in functionality.
|
|
|
|
That being said, I'm a /DevOps engineer/. That means, I need to know.
|
|
|
|
Yea, it doesn't mean I'll do something about it right away, but I'd like to be in the know.
|
|
|
|
Which got me thinking...
|
|
|
|
* Healthchecks Endpoints
|
|
Watching *borg* use its /healthchecks/ hook opened my eyes on another functionality of *Healthchecks*.
|
|
|
|
It seems that if you ping
|
|
#+BEGIN_EXAMPLE
|
|
https://healthchecks.example.com/ping/84b2a834-02f5-524f-4c27-a2f24562b219/start
|
|
#+END_EXAMPLE
|
|
|
|
It will start a counter that will measure the time until you ping
|
|
#+BEGIN_EXAMPLE
|
|
https://healthchecks.example.com/ping/84b2a834-02f5-524f-4c27-a2f24562b219
|
|
#+END_EXAMPLE
|
|
|
|
This way, you can find out how long it is taking you to check on the status of a service. Or maybe, how long a service is taking to backup.
|
|
|
|
It turns out that /healthchecks/ also offers a different endpoint to ping. You can report a failure straight away by pinging
|
|
|
|
#+BEGIN_EXAMPLE
|
|
https://healthchecks.example.com/ping/84b2a834-02f5-524f-4c27-a2f24562b219/fail
|
|
#+END_EXAMPLE
|
|
|
|
This way, you do not have to wait until the time expires before you get notified of a failure.
|
|
|
|
With those pieces of knowledge, we can do a lot.
|
|
|
|
* A lot ?
|
|
Yes, a lot...
|
|
|
|
Let's put what we have learned so far into action.
|
|
|
|
#+BEGIN_SRC sh :noeval
|
|
#!/bin/bash
|
|
|
|
WEB_HOST=$1
|
|
CHECK_ID=$2
|
|
|
|
HEALTHCHECKS_HOST="https://healthchecks.example.com/ping"
|
|
|
|
curl -fsS --retry 3 "${HEALTHCHECKS_HOST}/${CHECK_ID}/start" > /dev/null
|
|
|
|
OUTPUT=`curl -sS "${WEB_HOST}"`
|
|
STATUS=$?
|
|
|
|
if [[ $STATUS -eq 0 ]]; then
|
|
curl -fsS --retry 3 "${HEALTHCHECKS_HOST}/${CHECK_ID}" > /dev/null
|
|
else
|
|
curl -fsS --retry 3 "${HEALTHCHECKS_HOST}/${CHECK_ID}/fail" > /dev/null
|
|
fi
|
|
#+END_SRC
|
|
|
|
We start by defining a few variables for the website hostname to monitor, the check ID provided by /healthchecks/ and finally the /healthchecks/ base link for the monitors.
|
|
|
|
Once those are set, we simply use =curl= with a couple of special flags to make sure that it fails properly if something goes wrong.
|
|
|
|
We start the /healthchecks/ timer, run the website check and either call the passing or the failing /healthchecks/ endpoint depending on the outcomes.
|
|
|
|
#+BEGIN_EXAMPLE
|
|
$ chmod +x https_healthchecks_monitor.sh
|
|
$ ./https_healthchecks_monitor.sh https://healthchecks.example.com 84b2a834-02f5-524f-4c27-a2f24562b219
|
|
#+END_EXAMPLE
|
|
|
|
Test it out.
|
|
|
|
* Okay, that's nice but now what !
|
|
Now, let's hook it up to our cron.
|
|
|
|
Start with =crontab -e= which should open your favorite text editor.
|
|
|
|
Then create a cron entry (a new line) like the following:
|
|
|
|
#+BEGIN_EXAMPLE
|
|
*/15 * * * * /path/to/https_healthchecks_monitor.sh https://healthchecks.example.com 84b2a834-02f5-524f-4c27-a2f24562b219
|
|
#+END_EXAMPLE
|
|
|
|
This will run the script every 15 minutes. Make sure that your timeout is 15 minutes for this check, with a grace period of 5 minutes.
|
|
That configuration will guarantee that you will get notified 20 minutes after any failure, at the worst.
|
|
|
|
Be aware, I said any failure.
|
|
Getting notified does not guarantee that your website is down.
|
|
It can only guarantee that /healthchecks/ wasn't pinged on time.
|
|
|
|
Getting notified covers a bunch of cases. Some of them are:
|
|
- The server running the cron is down
|
|
- The cron services is not running
|
|
- The server running the cron lost internet access
|
|
- Your certificate expired
|
|
- Your website is down
|
|
|
|
You can create checks to cover most of these if you care to make it a full monitoring system.
|
|
If you want to go that far, maybe you should invest in a monitoring system with more features.
|
|
|
|
* Conclusion
|
|
Don't judge something by its simplicity. Somethings, out of simple components tied together you can make something interesting and useful.
|
|
With a little of scripting, couple of commands and the power of cron we were able to make /healthchecks/ monitor our websites.
|