Misunderstanding mlock(2) and mlockall(2)

December 28, 2015

Unix systems define these following related four system calls:

int mlock(const void *addr, size_t len);
int munlock(const void *addr, size_t len);

int mlockall(int flags);
int munlockall(void);

These calls are frequently misunderstood and abused.

The purpose of mlock() is to "lock" one or more pages of memory into RAM. These "locked" pages will not swapped out to the swap area under any condition. As you might guess, munlock() is analogous and lets you unlock pages that were previously locked. The system calls mlockall() and munlock() are similar but instead say "lock all of the pages for my process into RAM, no matter what". The flags argument to mlockall() controls whether or not future pages are locked as well.

The Linux man pages for these system calls actually do a great job of explaining the details of how and why to use these system calls in the "Notes" section, which of course few people read. I'm going to do my best in this article to explain what these calls are for and why you may be using them wrong in this article.

What People Think They're For

Some people think that these system calls are a good way to improve the performance of a high-performance process on a system. A common use case I've seen in the real world is to try to call mlockall() on a program that's supposed to running with very high performance. The reasoning is that if the program is paged out to disk, that will reduce performance; therefore mlockall() will improve things.

If you try to actually use mlockall() in this way you might run into some difficulties because most systems have a very low default ulimit on the number of pages a process can lock. With some twiddling of the default ulimits you can get this working, but perhaps it's worth considering why the default ulimits are so low in the first place.

What They're Actually For

There are two real use cases for using mlock():

programs that need to store passwords or sensitive decrypted data in memory
programs that need to operate in a real-time environment

Let's consider these separately. The first case is actually the most common legitimate use of mlock() that I've seen.

Consider how the ssh-agent program works. On your computer you have an SSH key which hopefully you've configured to be encrypted with a password (if you haven't done this, go do it now). Ordinarily this would mean that every time you have to SSH to a host you'd have to type your SSH password to decrypt it for the SSH program. This would be tedious and annoying. Therefore the OpenSSH client library ships with a program called ssh-agent. The agent's job is to store decrypted SSH keys in memory securely, and then automatically pass them to the SSH program when needed. Therefore you have to only decrypt the SSH key once, and from then on the decrypted key will be stored in memory in the SSH process.

Imagine there was someone who really wanted to steal your SSH key. In the scheme I just described, here's one attack that they might do:

force the machine that has ssh-agent running to start swapping heavily
wait until the ssh-agent process is paged out to the swap area
power off your computer
take out the hard drive, connect it to another computer, and then examine the pages in the swap area until they find ssh-agent's pages
try to identify which page has the decrypted key (e.g. by testing all possibly keys aligned on page boundaries, by looking for characteristics of the struct holding decrypted keys, etc.)

This is definitely a real attack. But think about what it requires:

the attacker doesn't have root on your machine (because if they did they could trivially inspect the memory of the ssh-agent process, or connect to the ssh-agent local socket)
the attacker has a way to force your machine to swap heavily
the attacker can force your machine to be powered off and access the hard drive

In real life, these scenarios are very unlikely. So while it's possible in theory that someone could attack your machine in this way, it's probably not going to happen in real life. If the ssh-agent process calls mlock() on the pages of memory that contain the decrypted keys then the attack I described is not possible because the sensitive pages will never be paged out to the swap area. If you were writing a program to do something like this in an interpreted language like Python or Java you'd likely have to use mlockall() since you generally can't control what pages objects will be allocated in, and in some VMs (e.g. the Oracle JVM) objects can be relocated anyway.

The other use of mlock() is for real-time applications. This means applications that are hard real-time. For instance, imagine you're using a Linux program to control an industrial welding laser. If some pages of your application get swapped out to disk, then fetching the pages back to memory could cause your industrial welding laser to delay switching on or off, which in some cases could be disastrous. This is absolutely a valid case for mlock() or mlockall().

Here's the catch with the real-time use case. The Linux kernel itself is not real-time. There are a ton of algorithms in the kernel that have time complexity worse than O(1). There are a set of out of tree real-time patches for the kernel that are maintained at rt.wiki.kernel.org. So you can use mlock() or mlockall() for your real-time program if all of the following conditions apply:

your program is written in a language that has hard RT characteristics, e.g. a language like C or C++ that isn't garbage collected, or in a language that has deterministic GC pause times
you're running on a version of the Linux kernel that has the out-of-tree RT patches applied

Note that in all of these cases, you can also just turn off the swap area and get the same effect.

Alternatives On Linux

It's reasonable to say: I have a program that's supposed to be running with very high performance, and I'd like it to run fast even if the system is heavily under load or if there is memory pressure.

First, you need to understand that there are actually two different things that are going to cause your program to run slowly if the system is under load. The system can be scheduling your process too infrequently, or your could be experiencing problems due to swapping. Let's examine these separately.

High Load/Scheduling Problems

If the system is under high load then you'll get poor performance because the kernel has to run a bunch of other programs in addition to yours. Since it can't run them all at once, your program will only get occassional CPU time, and it will therefore run slowly.

The very first thing you should look at in this situation is the "niceness" value for your process. The default niceness is 0. If you decrease the niceness then your program will get priority when the system is under load, which will improve performance.

If you need something more powerful than nice, have you looked at all of the sched(7) options? On Linux there are a bunch of advanced system calls that let you get fine-tuned control over your scheduling priority. Of course, this assumes that you're not using a language with non-deterministic GC pause times in the first place (I'm looking at you, Java).

Swapping/Memory Pressure Issues

The second issue is swapping, which is what you'd be concerned about if you're looking at mlock() or mlockall() in the first place. First, make sure that your production systems aren't regularly swapping.

Additionally you might consider looking at the "swappiness" value in the sysctl vm.swappiness which can also be accessed via procfs at /proc/sys/vm/swappiness. If you're running a process that is expected to consume nearly all of the memory on the system then you should shower this value, e.g. to 10. If you're running MySQL with the InnoDB buffer pool and other buffers set to use nearly all of the machine's memory, e.g. you expect to be running at 95% or higher memory usage, you want to do this. The same applies if you're running the JVM and you've set the max heap size to be nearly all of the system memory with just a small amount left over for system processes.

You can also set the swappiness on a per-process level using cgroups, using the memory.swappiness control file in the cgroup. This file has the same meaning/format as the sysctl. If you use this technique you can make it so a process like MySQL is less likely to swap out than other processes on the host.

If you have proper limits on your application's max heap size (or equivalently, it's RSS size) and are running a limited other set of things on the system, then you should be able to set the swappiness so you won't ever swap at all.

Disabling Swapping

This is a bit controversial, but I'm a fan of disabling swapping completely on hosts that have redundancy in production. For instance, on your application servers or database slaves (but maybe not on your database master).

My reasoning here is that in production systems swapping is generally an errant condition that should be treated as a hard failure. For instance, let's say you have an application with a memory leak. If the application is leaking memory then the amount of memory it is trying to use will grow and grow indefinitely. At some point this will cause swapping to occur. More and more pages will be paged out to the swap partition, and things will get slow. The application will keep leaking memory. At some point all of the space in the swap partition will be exhausted, and the kernel OOM killer will decide to start killing processes---likely your application that is leaking memory.

In this situation the process is going to get OOM killed no matter what. If you have swapping enabled then the process will get really slow and then get OOM killed. If you don't have swapping enabled then the process will get OOM killed without getting really slow.

You can also avoid this problem using cgroups and limiting the amount of RSS memory the process can use. But this is a little more work and requires tuning on a per-application basis, whereas just disabling swap will cause errant processes to be quickly killed.

Why Misusing mlock(2) and mlock(2) Is Really Bad

But besides all this, why is mlockall() bad?

Let's assume you have your production machines configured correctly and you don't normally have any pages swapped out. When you type free you see 0 under the swap pages used. Then something bad happens, and you get into the unexpected situation where your prod machine is swapping.

First, the Linux kernel will use an LRU algorithm to choose what pages to swap out. That means that it will start swapping out the pages that have been least recently used first. This is almost always the best algorithm. Under certain pathological page access conditions it can fail (e.g. scanning pages in a huge circular buffer), but this would be really unusual.

This means that if the kernel decides to swap out one of the pages of your allegedly important process, it probably means the page that you think is so important actually isn't important---because you're not using it! There are a ton of reasons that programs might allocate pages that subsequently don't get used at all, so assuming that every page in your program is sacred and cannot be paged out is a bad assumption.

Second, how do you know that your process' pages are really more important than another processes' pages? I'll give you an example. Let's say you're running a high-performance server process that's supposed to use nearly all of the system's memory, and you've left a small amount of extra memory remaining for critical system processes like init, cron, syslog, and so forth. If you mlockall() your program then in the unexpected case where swapping occurs you'll cause these processes to start swapping instead. This can be really bad.

For instance, let's say your use of mlockall() ends up causing syslog to swap. Then all applications logging to syslog to slow down since syslog will be slow. This can compound performance problems on the rest of the system since normally logging to syslog is synchronous. In fact, it can be worse if you have a "clever" application that is asynchronously logging to syslog with another thread. The way such an application will be written is that it will likely use a thread-safe queue to send messages to the logger thread, and the logger thread will read from the queue and then log to syslog. If syslog gets slow then this queue could start growing which will increase memory pressure even further. In the worst case the queue would be unbounded which would eventually lead to out-of-memory conditions.

This is clearly a pretty complex issue with a lot of factors to consider. If you're considering using mlockall() for reasons related to throughput and not for security or true real-time applications, I'd encourage you to consider the ramifications of mlockall() and the alternatives I presented above.