Using Watchdog to Always Keep a Machine Running

Watchdog is a daemon / subsystem used to monitor the basic health of a machine. If something goes wrong, such as a crashing program overloading the CPU, or no more free memory on the system, watchdog can safely reboot the machine, allowing your service to keep on going rather than wait for someone to come kick the machine.

Or, better yet, it has a repair option to run a script you give it that will attempt to repair the system, and only reboot as a last ditch effort.

dog watching

How it does it

Watchdog has a list of options that for what it can monitor. These include:

  • Ping a host, showing that network in reachable.
  • Check if a file been modified since x minutes, such as the logs showing that there is system activity
  • Verify the load under a set amount.
  • If memory can be allocated. If not, programs won’t start / will crash.

You can set any, all, or none of these options.

Watchdog is careful about the reboot. When it reboots the machine, it does so in a controlled matter. No sudden reboot. It kills processes, syncs the disks, unmounts the filesystems, and otherwise does a full reboot like the system normally would.

To protect against a situation where watchdog dies or otherwise stops running, a device can created that will require a write within a minute, otherwise will trigger a reboot of the machine. This is optional, but when used watchdog will write to this device every second, resetting the kernel’s timer on it.

In the below example, we will use the softdog module to make this device, though there are hardware options out there that can do this. The device will be called /dev/watchdog.

Note that watchdog may not help with kernel panics and/or system hangs. For that I would suggest setting actions for these specific cases, found in this tutorial.

Installing

# Redhat/Fedora/Centos
dnf install watchdog

# Ubuntu/Debian
apt-get install watchdog

# Arch
pacman -Sy watchdog

And then disable it. Yep, that’s right. Unless you know what you’re doing, you should disable it until you know it’s working. Distros enable it because in their default config they disable most of the checks. We will be modifying those, so let’s play it safe.

systemct disable watchdog

If it’s enabled, and you mis-configure it (like setting the max load to “1”), the machine will immediately reboot, and when it comes up it will reboot again, because watchdog is enabled at startup and it sees that things are still “failing”.

Setting it up

The main configuration file is /etc/watchdog.conf. This is where we will set which tests to run, where our repair script is, and to use the watchdog device.

Then there is a file that has our watchdog service settings, such the options to give the watchdog program. This is usually /etc/default/watchdog or /etc/sysconfig/watchdog depending on your system.

The following options check for certain items, and upon failure, it will run the repair binary and/or reboot.

Checking the load

This is my go to check for watchdog. It just makes sure there isn’t a huge load, and that the system hasn’t crashed. Yes, when a system crashes, it’s probably still running. But the load will skyrocket. How high? I’ve had a system that once hit 2000 for the load! How it was still running is beyond me.

The default values will have it be around 10 or 30 max. I think that’s too low, as I have machines do nightly cron jobs that will cause a short spike above 50. But they are still running just fine, and I don’t see the point of rebooting just for that.

What you set them to will depend on your setup, the jobs the machine is doing, and how high your loads go. Does it sometimes skyrocket? Do you mind if it starts chugging, or do you just want to have it reboot? Look at the load history of your server (you have that right?).

It’s simply a balance between not wanting to reboot while the machine is doing a lot of work, and not wanting to wait forever when issues are happening.

The first one is max-load-1 which is the 1 minute load, and then max-load-5 for the 5 minute and max-load-15 for the 15 minute.

For small systems I like to have max-load-1 be 150 times the number of CPUs I have, up to 4 CPUs, where I drop it to 100 times. This allows it to spike for a bit if needed.

And then for the others, I drop it to 100 for the 5 minute load, and 50 for the 15 minute. That may still be too high, but they seem to work. You may

So for a single CPU system, I would set it to be:

max-load-1		= 150
max-load-5		= 100
max-load-15		= 50

And for 16 cores:

max-load-1		= 1600
max-load-5		= 1000
max-load-15		= 800

Memory

This test will attempt to allocate (use) the given amount of memory.

min-memory		= 1
allocatable-memory	= 1

Network checks

These are basic pings and monitoring of the network. Half the time I don’t use them, but

This will ping a server. You could set it to your router. Just know that if the router goes does, the machine will reboot.

ping			= 172.31.14.1

Monitor this interface for traffic. Note that if this is a slow site that doesn’t see much traffic, it will reboot.

interface		= eth0

Files and pidfiles

Here you can monitor a file for changes. If no changes happen in change minutes, it will reboot.

#file			= /var/log/messages
#change			= 1407

I don’t use the pidfile option. In my option, it’s better to use monit that can both monitor the pid file, as well as actually test a connection to the service.

Test script

You can run your own test script. Any error besides 0 will be seen as an error and cause watchdog to reboot the system.

test-binary  = /path/to/your/check_script
test-timeout = <timeout in seconds>

Repair script

The Repair binary is a program you create that tries to fix the machine rather than having to do a full reboot. The script should exit 0 if the system was repaired,

repair-binary  = /path/to/your/repair_script
# how long to allow the script to run in seconds
repair-timeout = 30
# number of times repair script is allowed to fix the issue before we just reboot
repair-maximum = 3

If this machine is part of a cluster/swarm/kubernetes, you can have your repair script fence off and otherwise get the machine out of the cluster. And offload the containers and jobs to other machines.

Watchdog Device (Optional)

As said before, this device is written to every second by watchdog, and if it’s not written to within 60 seconds, the machine will reboot.

First, we set watchdog to use the watchdog device in /etc/watchdog.conf:

watchdog-device	= /dev/watchdog

Then in /etc/default/watchdog or /etc/sysconfig/watchdog, we set the service to load the softdog module which will create the watchdog device. This is with watchdog_module.

watchdog_module="softdog"

No Action

It is advisable to set it to testing mode. Or you may get kicked out by a random reboot.

Add the --no-action option to ‘the watchdog_options in /etc/default/watchdog or /etc/sysconfig/watchdog. This will cause watchdog to only say it’s going to reboot, without actually rebooting. And we will add the -v option so we can see the checks that are happening.

watchdog_options="-v --no-action"

As a reminder, make sure you’ve disabled the watchdog service, so it won’t keep on rebooting your machine if something is not configured properly.

You may have to add this line, as it’s usually not there.

Running it

Now that you’ve configured it, go ahead and start it. (Or restart it if it’s running)

systemctl start watchdog

# or
service watchdog start

And watch the logs

journalctl -f

You should see something like

... watchdog[3485]: currently there are 99748 + 1749500 kB of free memory+swap available
... watchdog[3485]: got answer on ping=1 from target w.x.y.z      time=2.805ms
... watchdog[3485]: still alive after 17 interval(s)
... watchdog[3485]: current load is 0 0 0

Check that everything is working properly, and fix any errors or problems.

Check watchdog device

If you enabled the watchdog device, then check that the softdog module was loaded

lsmod | grep softdog

It should list the module in the output.

softdog                16384  2

And check that the watchdog device exists.

ls /dev/watchdog

For real

Once you’ve got it setup the way you want, and there aren’t any issues, now we get to run it for real. In /etc/default/watchdog or /etc/sysconfig/watchdog, either comment out or remove the no action line:

# watchdog_options="-v --no-action"

And restart watchdog

systemctl restart watchdog

Check the logs for any issues. Once you’re set, enable watchdog.

systemctl enable watchdog

And you’re set.

No way out (Optional)

The watchdog device can be configured to always require a write, meaning that it’s countdown can’t be stopped, even if watchdog is shutdown properly. This can prevent someone from shutting down the service, which will close the device, undoing all your hard work.

Or just because you want to.

This is accomplished by using the softdog option nowayout, making it so once loaded, the countdown cannot be stopped.

Be careful with this option, as it will reboot your machine if you stop the daemon.

Modify the watchdog_module option in /etc/default/watchdog or /etc/sysconfig/watchdog. Or add it.

watchdog_module="softdog nowayout"

If you want to get really serious, you could add it to /etc/modules-load.d/softdog.conf. However, that means that it will be loaded at system startup. If the watchdog service doesn’t start, the machine gets rebooted.

# /etc/modules-load.d/softdog.conf
softdog nowayout

Don’t blindly add that last one. Read the warning above it.

Conclusion

Watchdog is a great addition to help keep the system running, along with the sysctl options for kernel panics.

Other resources