Watchdog is a daemon / subsystem used to monitor the basic health of a machine. If something goes wrong, such as a crashing program overloading the CPU, or no more free memory on the system, watchdog can safely reboot the machine, allowing your service to keep on going rather than wait for someone to come kick the machine.
Or, better yet, it has a
repair option to run a script you give it that will attempt to repair the system, and only reboot as a last ditch effort.
How it does it
Watchdog has a list of options that for what it can monitor. These include:
- Ping a host, showing that network in reachable.
- Check if a file been modified since x minutes, such as the logs showing that there is system activity
- Verify the load under a set amount.
- If memory can be allocated. If not, programs won’t start / will crash.
You can set any, all, or none of these options.
Watchdog is careful about the reboot. When it reboots the machine, it does so in a controlled matter. No sudden reboot. It kills processes, syncs the disks, unmounts the filesystems, and otherwise does a full reboot like the system normally would.
To protect against a situation where watchdog dies or otherwise stops running, a device can created that will require a write within a minute, otherwise will trigger a reboot of the machine. This is optional, but when used watchdog will write to this device every second, resetting the kernel’s timer on it.
In the below example, we will use the
softdog module to make this device, though there are hardware options out there that can do this. The device will be called
Note that watchdog may not help with kernel panics and/or system hangs. For that I would suggest setting actions for these specific cases, found in this tutorial.
# Redhat/Fedora/Centos dnf install watchdog # Ubuntu/Debian apt-get install watchdog # Arch pacman -Sy watchdog
And then disable it. Yep, that’s right. Unless you know what you’re doing, you should disable it until you know it’s working. Distros enable it because in their default config they disable most of the checks. We will be modifying those, so let’s play it safe.
systemct disable watchdog
If it’s enabled, and you mis-configure it (like setting the max load to “1”), the machine will immediately reboot, and when it comes up it will reboot again, because watchdog is enabled at startup and it sees that things are still “failing”.
Setting it up
The main configuration file is
/etc/watchdog.conf. This is where we will set which tests to run, where our repair script is, and to use the watchdog device.
Then there is a file that has our watchdog service settings, such the options to give the watchdog program. This is usually
/etc/sysconfig/watchdog depending on your system.
The following options check for certain items, and upon failure, it will run the repair binary and/or reboot.
Checking the load
This is my go to check for watchdog. It just makes sure there isn’t a huge load, and that the system hasn’t crashed. Yes, when a system crashes, it’s probably still running. But the load will skyrocket. How high? I’ve had a system that once hit 2000 for the load! How it was still running is beyond me.
The default values will have it be around 10 or 30 max. I think that’s too low, as I have machines do nightly cron jobs that will cause a short spike above 50. But they are still running just fine, and I don’t see the point of rebooting just for that.
What you set them to will depend on your setup, the jobs the machine is doing, and how high your loads go. Does it sometimes skyrocket? Do you mind if it starts chugging, or do you just want to have it reboot? Look at the load history of your server (you have that right?).
It’s simply a balance between not wanting to reboot while the machine is doing a lot of work, and not wanting to wait forever when issues are happening.
The first one is
max-load-1 which is the 1 minute load, and then
max-load-5 for the 5 minute and
max-load-15 for the 15 minute.
For small systems I like to have
max-load-1 be 150 times the number of CPUs I have, up to 4 CPUs, where I drop it to 100 times. This allows it to spike for a bit if needed.
And then for the others, I drop it to 100 for the 5 minute load, and 50 for the 15 minute. That may still be too high, but they seem to work. You may
So for a single CPU system, I would set it to be:
max-load-1 = 150 max-load-5 = 100 max-load-15 = 50
And for 16 cores:
max-load-1 = 1600 max-load-5 = 1000 max-load-15 = 800
This test will attempt to allocate (use) the given amount of memory.
min-memory = 1 allocatable-memory = 1
These are basic
pings and monitoring of the network. Half the time I don’t use them, but
This will ping a server. You could set it to your router. Just know that if the router goes does, the machine will reboot.
ping = 172.31.14.1
Monitor this interface for traffic. Note that if this is a slow site that doesn’t see much traffic, it will reboot.
interface = eth0
Files and pidfiles
Here you can monitor a file for changes. If no changes happen in
change minutes, it will reboot.
#file = /var/log/messages #change = 1407
I don’t use the
pidfile option. In my option, it’s better to use
monit that can both monitor the pid file, as well as actually test a connection to the service.
You can run your own test script. Any error besides 0 will be seen as an error and cause watchdog to reboot the system.
test-binary = /path/to/your/check_script test-timeout = <timeout in seconds>
Repair binary is a program you create that tries to fix the machine rather than having to do a full reboot. The script should exit
0 if the system was repaired,
repair-binary = /path/to/your/repair_script # how long to allow the script to run in seconds repair-timeout = 30 # number of times repair script is allowed to fix the issue before we just reboot repair-maximum = 3
If this machine is part of a cluster/swarm/kubernetes, you can have your repair script fence off and otherwise get the machine out of the cluster. And offload the containers and jobs to other machines.
Watchdog Device (Optional)
As said before, this device is written to every second by watchdog, and if it’s not written to within 60 seconds, the machine will reboot.
First, we set watchdog to use the watchdog device in
watchdog-device = /dev/watchdog
/etc/sysconfig/watchdog, we set the service to load the softdog module which will create the watchdog device. This is with
It is advisable to set it to
testing mode. Or you may get kicked out by a random reboot.
--no-action option to ‘the watchdog_options in
/etc/sysconfig/watchdog. This will cause watchdog to only say it’s going to reboot, without actually rebooting. And we will add the
-v option so we can see the checks that are happening.
As a reminder, make sure you’ve disabled the watchdog service, so it won’t keep on rebooting your machine if something is not configured properly.
You may have to add this line, as it’s usually not there.
Now that you’ve configured it, go ahead and start it. (Or restart it if it’s running)
systemctl start watchdog # or service watchdog start
And watch the logs
You should see something like
... watchdog: currently there are 99748 + 1749500 kB of free memory+swap available ... watchdog: got answer on ping=1 from target w.x.y.z time=2.805ms ... watchdog: still alive after 17 interval(s) ... watchdog: current load is 0 0 0
Check that everything is working properly, and fix any errors or problems.
Check watchdog device
If you enabled the watchdog device, then check that the
softdog module was loaded
lsmod | grep softdog
It should list the module in the output.
softdog 16384 2
And check that the watchdog device exists.
Once you’ve got it setup the way you want, and there aren’t any issues, now we get to run it for real. In
/etc/sysconfig/watchdog, either comment out or remove the no action line:
# watchdog_options="-v --no-action"
And restart watchdog
systemctl restart watchdog
Check the logs for any issues. Once you’re set, enable watchdog.
systemctl enable watchdog
And you’re set.
No way out (Optional)
The watchdog device can be configured to always require a write, meaning that it’s countdown can’t be stopped, even if watchdog is shutdown properly. This can prevent someone from shutting down the service, which will close the device, undoing all your hard work.
Or just because you want to.
This is accomplished by using the softdog option
nowayout, making it so once loaded, the countdown cannot be stopped.
Be careful with this option, as it will reboot your machine if you stop the daemon.
watchdog_module option in
/etc/sysconfig/watchdog. Or add it.
If you want to get really serious, you could add it to
/etc/modules-load.d/softdog.conf. However, that means that it will be loaded at system startup. If the watchdog service doesn’t start, the machine gets rebooted.
# /etc/modules-load.d/softdog.conf softdog nowayout
Don’t blindly add that last one. Read the warning above it.
Watchdog is a great addition to help keep the system running, along with the sysctl options for kernel panics.