Using the watchdog
The watchdog system can be configured from the admin shell. The watchdog maintains a list of checks for your machine which are performed every 5-10 minutes. These checks can do one of the following:
- Ping an IP address associated with your machine.
- Connect to a TCP port on your machine and optionally look for a particular “banner response”.
- Try to fetch a web page from a particular IP address and check that it contains a given piece of text.
Each check as an action associated with it which can be one of two things:
- Send you an email
- Send you an SMS
The email address and SMS phone number are fixed in our contact database: you should make sure your details are kept up-to-date via the control panel for these to work properly. Also note that the reboot function will only work once every half an hour as a safeguard against persistent problems making your machine completely inaccessible!
Some examples
Checking that SSH is still responding
This is often a more useful test than simply pinging your host to determine whether it is “up”, since the Linux kernel will respond to pings after crashing quite badly sometimes!
watchdog add SSHTest TCPBanner 80.68.89.1 22 "SSH-2.0" action sms
Assuming your IP address is 80.68.89.1, this watchdog rule will send you a text message if your ssh daemon ever stops responding, which is usually a fair sign that your machine is in trouble.
Making sure your web site is still working
If you’re developing a site for a client who is going to be very sensitive to down time, whatever the cause, it helps to check that a particular page will always fetch correctly. You should pick a page on the site which accesses databases or other external resources, and check for output that confirms that such external data has been read correctly.
watchdog add SiteCheck1 HTTPFetch 80.68.89.1 80 www.example.com \
/product.php?id=37 200 "Widgets" action email
watchdog add SiteCheck2 HTTPFetch 80.68.89.1 80 www.example.com \
/product.php?id=37 200 "Widgets" action sms
This pair of rules will make sure that the URL http://www.example.com/product.php?id=37 always fetches successfully (HTTP status code 200) and always contains the phrase “Widgets”. If it does not, you will receive a text message and an email.
Checking a tunnelled machine is still accessible
Some customers use extra IPs on their machine to tunnel to machines outside of their network; because tunnelling setups can be quite brittle in the early stages of configuration it helps to know when they go down so the reason can be investigated quickly.
watchdog add TunnelUp Ping 80.68.89.2 action email
What do the alerts look like?
Here is an example SMS sent by our system: the first word will be either PROBLEM (indicating that a check has failed) or RECOVERY (indicating that the check has succeeded again). The number in brackets is the number of times the message has been sent: repeat PROBLEM alerts will be sent once every 20-30 minutes until the problem is fixed. Then it lists the test name which failed and the DNS name of the host concerned.
Topics:
