Uninterruptible Sleep

One of the curious features of Unix systems (including Linux) is the "uninterruptible sleep" state. This is a state that a process can enter when doing certain system calls. In this state, the process is blocked performing a sytem call, and the process cannot be interrupted (or killed) until the system call completes. Most of these uninterruptible system calls are effectively instantaneous meaning that you never observe the uninterruptible nature of the system call. In rare cases (often because of buggy kernel drivers, but possibly for other reasons) the process can get stuck in this uninterruptible state. This is very similar to the zombie process state in the sense that you cannot kill a process in this state, although it's worth that the two cases happen for different reasons. Typically when a process is wedged in the uninterruptible sleep state your only recourse is to reboot the system, because there is literally no way to kill the process.

One infamous example of this has been Linux with NFS. For historical reasons certain local I/O operations are not interruptible. For instance, the mkdir(2) system call is not interruptible, which you can verify from its man page by observing that this system call cannot return EINTR. On a normal system the worst case situation for mkdir would be a few disk seeks, which isn't exactly fast but isn't the end of the world either. On a networked filesystem like NFS this operation can involve network RPC calls that can block, potentially forever. This means that if you get the right kind of horkage under NFS, a program that calls mkdir(2) can get stuck in the dreaded uninterruptible sleep state forever. When this happens there's no way to kill the process and the operator has to either live with this zombie-like process or reboot the system. The Linux kernel programmers could "fix" this by making the mkdir(2) system call interruptible so that mkdir(2) could return EINTR. However, historical Unix system since the dawn of time don't return EINTR for this system call so Linux adopts the same convention.

This was actually a big problem for us at my first job out of college at Yelp. At the time we had just taken the radical step of moving images out of MySQL tables storing the raw image data in a BLOB column, and had moved the images into NFS served from cheap unreliable NFS appliances. Under certain situations the NFS servers would lock up and processes accessing NFS would start entering uninterruptible sleep as they did various I/O operations. When this happened, very quickly (e.g. in a minute or two) every single Apache worker would service a request handler doing one of these I/O operations, and thus 100% of the Apache workers would become stuck in the uninterruptible sleep state. This would quite literally bring down the entire site until we rebooted everything. We eventually "solved" this problem by dropping the NFS dependency and moving things to S3.

Another fun fact about the uninterruptible sleep state is that occassionally it may not be possible to strace a process in this state. The man page for the ptrace system call notes that under rare circumstances attaching to a process using the ptrace system call can cause the traced process to be interrupted. If the process is in uninterruptible sleep then the process can't be interrupted, which will cause the strace process itself to hang forever. Remarkably, it appears that the ptrace(2) system call is itself uninterruptible, which means that if this happens you may not be able to kill the strace process!

Tonight I learned about a "new" feature in Linux: the TASK_KILLABLE state. This is sort of a compromise between processes in interruptible sleep and processes in uninterruptible sleep. A process in the TASK_KILLABLE state still cannot be interrupted in the usual sense (i.e. you can't force the system call to return EINTR); however, processes in this state can be killed. This means that, for instance, processes doing I/O over NFS can be killed if they get into a wedged state. Not all system calls implement this state, so it's still possible to get stuck unkillable processes for some system calls, but it's certainly an improvement over the previous situation. As usual LWN has a great article on the subject including information about the historical semantics of uinterruptible sleep on Linux.