Kernel Panics and Lockups

Kernel panics and lockups happen for various reasons. From bad hardware to kernel bugs to over heating, to cosmic rays, there’s a lot that can cause a machine to stop responding.

Which is really a pain if your machine is far away, and you’re wanting High Availability. Your machine locks up, and now you have have to reboot it manually.

(If your machine has locked up, you can try the Magic SysRq keys to reboot it safely.)

Unless you set a few settings in sysctl. We can tell the kernel to reboot in the event of a panic rather than just lock, which is the default.

On systemd, you must set your sysctl settings in the /etc/sysctl.d folder in whatever file you decide, that ends in .conf

That’s what this first one does. It will tell the machine how long after a panic to wait before rebooting (0=disabled).

# Reboot this many seconds after panic
kernel.panic = 20

There are other things that can cause a machine to lock up or become unstable. Some of them will even make a machine responsive to pings and network heartbeat monitors, but will cause programs to crash and internal systems to lockup.

If you want the machine to automatically reboot, make sure you set kernel.panic to something above 0. Otherwise these settings could cause a hung machine that you will have to reboot manually.

This first one controls what the kernel will do if there’s a i/o error. Set this to ‘1’ to enable panicking if that error happens.

kernel.panic_on_io_nmi = 1

And for those pesky hung tasks, this one will panic (reboot) the machine if there’s a process that hangs for long than the given amount of time.

kernel.hung_task_panic = 1
kernel.hung_task_timeout_secs = 300

This one is what alot of people could use. Every had a machine run out of memory? It’s the worst. It’s slow, or even non-responsive. This setting will cause the machine to reboot rather than just trying to kill processes that might just be respawning.

# 0=no | 1=usually | 2=always
vm.panic_on_oom=2

For more on diagnosing oom errors and fixing them, go here

There are also times when memory errors are detected. These are just bad. Ya, you should fix the issue, but the current processes may have become unstable. And what better message that you have an issue than a reboot? Well, maybe there’s a better a way (like grepping logs and emailing).

kernel.panic_on_unrecovered_nmi=1

This will cause oops to become panics. While usually only used for development, this can be very useful as oops are errors that don’t cause a crash but can cause a system to become unstable. This page and this page say it’s okay, especially when high availability is on the line, but I’ve had systems that would randomly throw oops but keep on running. Use with caution and research. And fix the issue ASAP.

kernel.panic_on_oops=30

You can put the whole setup in /etc/sysctl.d/panic.conf (or /etc/sysctl.conf if you don’t have a sysctl.d and don’t use systemd)

# Reboot this many seconds after panic
kernel.panic = 20

# Panic if the kernel detects an I/O channel
# check (IOCHK). 0=no | 1=yes
kernel.panic_on_io_nmi = 1

# Panic if a hung task was found. 0=no, 1=yes
kernel.hung_task_panic = 1

# Setup timeout for hung task,
# in seconds (suggested 300)
kernel.hung_task_timeout_secs = 300

# Panic on out of memory.
# 0=no | 1=usually | 2=always
vm.panic_on_oom=2

# Panic when the kernel detects an NMI
# that usually indicates an uncorrectable
# parity or ECC memory error. 0=no | 1=yes
kernel.panic_on_unrecovered_nmi=1

# Panic if the kernel detects a soft-lockup
# error (1). Otherwise it lets the watchdog
# process skip it's update (0)
# kernel.softlockup_panic=0

# Panic on oops too. Use with caution.
# kernel.panic_on_oops=30

Edit according to preference, then you can load it:

root@host# sysctl -p /etc/sysctl.d/panic.conf

– or –

root@host# sysctl -p /etc/sysctl.conf

The system will automatically load this when it first boots up.