Unix systems define these following related four system calls:
int mlock(const void *addr, size_t len);
int munlock(const void *addr, size_t len);
int mlockall(int flags);
These calls are frequently misunderstood and abused.
The purpose of
mlock() is to “lock” one or more pages of memory into RAM.
These “locked” pages will not swapped out to the swap area under any condition.
As you might guess,
munlock() is analogous and lets you unlock pages that were
previously locked. The system calls
munlock() are similar but
instead say “lock all of the pages for my process into RAM, no matter what”. The
flags argument to
mlockall() controls whether or not future pages are locked
The Linux man pages for
these system calls actually do a great job of explaining the details of how and
why to use these system calls in the “Notes” section, which of course few people
read. I’m going to do my best in this article to explain what these calls are
for and why you may be using them wrong in this article.
What People Think They’re For
Some people think that these system calls are a good way to improve the
performance of a high-performance process on a system. A common use case I’ve
seen in the real world is to try to call
mlockall() on a program that’s
supposed to running with very high performance. The reasoning is that if the
program is paged out to disk, that will reduce performance; therefore
mlockall() will improve things.
If you try to actually use
mlockall() in this way you might run into some
difficulties because most systems have a very low default ulimit on the number
of pages a process can lock. With some twiddling of the default ulimits you can
get this working, but perhaps it’s worth considering why the default ulimits are
so low in the first place.
What They’re Actually For
There are two real use cases for using
- programs that need to store passwords or sensitive decrypted data in memory
- programs that need to operate in a real-time environment
Let’s consider these separately. The first case is actually the most common
legitimate use of
mlock() that I’ve seen.
Consider how the
ssh-agent program works. On your computer you have an SSH key
which hopefully you’ve configured to be encrypted with a password (if you
haven’t done this, go do it now). Ordinarily this would mean that every time you
have to SSH to a host you’d have to type your SSH password to decrypt it for the
SSH program. This would be tedious and annoying. Therefore the OpenSSH client
library ships with a program called
ssh-agent. The agent’s job is to store
decrypted SSH keys in memory securely, and then automatically pass them to the
SSH program when needed. Therefore you have to only decrypt the SSH key once,
and from then on the decrypted key will be stored in memory in the SSH process.
Imagine there was someone who really wanted to steal your SSH key. In the scheme
I just described, here’s one attack that they might do:
- force the machine that has
ssh-agent running to start swapping heavily
- wait until the
ssh-agent process is paged out to the swap area
- power off your computer
- take out the hard drive, connect it to another computer, and then examine the
pages in the swap area until they find ssh-agent’s pages
- try to identify which page has the decrypted key (e.g. by testing all
possibly keys aligned on page boundaries, by looking for characteristics of
the struct holding decrypted keys, etc.)
This is definitely a real attack. But think about what it requires:
- the attacker doesn’t have root on your machine (because if they did they
could trivially inspect the memory of the ssh-agent process, or connect to
the ssh-agent local socket)
- the attacker has a way to force your machine to swap heavily
- the attacker can force your machine to be powered off and access the hard
In real life, these scenarios are very unlikely. So while it’s possible in
theory that someone could attack your machine in this way, it’s probably not
going to happen in real life. If the
ssh-agent process calls
mlock() on the
pages of memory that contain the decrypted keys then the attack I described is
not possible because the sensitive pages will never be paged out to the swap
area. If you were writing a program to do something like this in an interpreted
language like Python or Java you’d likely have to use
mlockall() since you
generally can’t control what pages objects will be allocated in, and in some VMs
(e.g. the Oracle JVM) objects can be relocated anyway.
The other use of
mlock() is for real-time applications. This means
applications that are hard real-time. For instance, imagine you’re using a Linux
program to control an industrial welding laser. If some pages of your
application get swapped out to disk, then fetching the pages back to memory
could cause your industrial welding laser to delay switching on or off, which in
some cases could be disastrous. This is absolutely a valid case for
Here’s the catch with the real-time use case. The Linux kernel itself is not
real-time. There are a ton of algorithms in the kernel that have time complexity
worse than O(1). There are a set of out of tree real-time patches for the
kernel that are maintained at rt.wiki.kernel.org.
So you can use
mlockall() for your real-time program if all of
the following conditions apply:
- your program is written in a language that has hard RT characteristics, e.g.
a language like C or C++ that isn’t garbage collected, or in a language that
has deterministic GC pause times
- you’re running on a version of the Linux kernel that has the out-of-tree RT
Note that in all of these cases, you can also just turn off the swap area and
get the same effect.
Alternatives On Linux
It’s reasonable to say: I have a program that’s supposed to be running with very
high performance, and I’d like it to run fast even if the system is heavily
under load or if there is memory pressure.
First, you need to understand that there are actually two different things that
are going to cause your program to run slowly if the system is under load. The
system can be scheduling your process too infrequently, or your could be
experiencing problems due to swapping. Let’s examine these separately.
High Load/Scheduling Problems
If the system is under high load then you’ll get poor performance because the
kernel has to run a bunch of other programs in addition to yours. Since it can’t
run them all at once, your program will only get occassional CPU time, and it
will therefore run slowly.
The very first thing you should look at in this situation is the “niceness”
value for your process. The default niceness is 0. If you decrease the niceness
then your program will get priority when the system is under load, which will
If you need something more powerful than nice, have you looked at all of the
sched(7) options? On Linux
there are a bunch of advanced system calls that let you get fine-tuned control
over your scheduling priority. Of course, this assumes that you’re not using a
language with non-deterministic GC pause times in the first place (I’m looking
at you, Java).
Swapping/Memory Pressure Issues
The second issue is swapping, which is what you’d be concerned about if you’re
mlockall() in the first place. First, make sure that
your production systems aren’t regularly swapping.
Additionally you might consider looking at the “swappiness” value in the sysctl
vm.swappiness which can also be accessed via procfs at
/proc/sys/vm/swappiness. If you’re running a process that is expected to
consume nearly all of the memory on the system then you should shower this
value, e.g. to 10. If you’re running MySQL with the InnoDB buffer pool and other
buffers set to use nearly all of the machine’s memory, e.g. you expect to be
running at 95% or higher memory usage, you want to do this. The same applies if
you’re running the JVM and you’ve set the max heap size to be nearly all of the
system memory with just a small amount left over for system processes.
You can also set the swappiness on a per-process level using cgroups, using the
memory.swappiness control file in the cgroup. This file has the same
meaning/format as the sysctl. If you use this technique you can make it so a
process like MySQL is less likely to swap out than other processes on the host.
If you have proper limits on your application’s max heap size (or equivalently,
it’s RSS size) and are running a limited other set of things on the system, then
you should be able to set the swappiness so you won’t ever swap at all.
This is a bit controversial, but I’m a fan of disabling swapping completely on
hosts that have redundancy in production. For instance, on your application
servers or database slaves (but maybe not on your database master).
My reasoning here is that in production systems swapping is generally an errant
condition that should be treated as a hard failure. For instance, let’s say you
have an application with a memory leak. If the application is leaking memory
then the amount of memory it is trying to use will grow and grow indefinitely.
At some point this will cause swapping to occur. More and more pages will be
paged out to the swap partition, and things will get slow. The application will
keep leaking memory. At some point all of the space in the swap partition will
be exhausted, and the kernel OOM killer will decide to start killing
processes—likely your application that is leaking memory.
In this situation the process is going to get OOM killed no matter what. If you
have swapping enabled then the process will get really slow and then get OOM
killed. If you don’t have swapping enabled then the process will get OOM killed
without getting really slow.
You can also avoid this problem using
limiting the amount of RSS memory the process can use. But this is a little more
work and requires tuning on a per-application basis, whereas just disabling swap
will cause errant processes to be quickly killed.
Why Misusing mlock(2) and mlock(2) Is Really Bad
But besides all this, why is
Let’s assume you have your production machines configured correctly and you
don’t normally have any pages swapped out. When you type
free you see 0 under
the swap pages used. Then something bad happens, and you get into the unexpected
situation where your prod machine is swapping.
First, the Linux kernel will use an LRU algorithm to choose what pages to swap
out. That means that it will start swapping out the pages that have been least
recently used first. This is almost always the best algorithm. Under certain
pathological page access conditions it can fail (e.g. scanning pages in a huge
circular buffer), but this would be really unusual.
This means that if the kernel decides to swap out one of the pages of your
allegedly important process, it probably means the page that you think is so
important actually isn’t important—because you’re not using it! There are a
ton of reasons that programs might allocate pages that subsequently don’t get
used at all, so assuming that every page in your program is sacred and cannot be
paged out is a bad assumption.
Second, how do you know that your process’ pages are really more important
than another processes’ pages? I’ll give you an example. Let’s say you’re
running a high-performance server process that’s supposed to use nearly all of
the system’s memory, and you’ve left a small amount of extra memory remaining
for critical system processes like
syslog, and so forth. If
mlockall() your program then in the unexpected case where swapping occurs
you’ll cause these processes to start swapping instead. This can be really bad.
For instance, let’s say your use of
mlockall() ends up causing syslog to swap.
Then all applications logging to syslog to slow down since syslog will be slow.
This can compound performance problems on the rest of the system since normally
logging to syslog is synchronous. In fact, it can be worse if you have a
“clever” application that is asynchronously logging to syslog with another
thread. The way such an application will be written is that it will likely use a
thread-safe queue to send messages to the logger thread, and the logger thread
will read from the queue and then log to syslog. If syslog gets slow then this
queue could start growing which will increase memory pressure even further. In
the worst case the queue would be unbounded which would eventually lead to
This is clearly a pretty complex issue with a lot of factors to consider. If
you’re considering using
mlockall() for reasons related to throughput and not
for security or true real-time applications, I’d encourage you to consider the
mlockall() and the alternatives I presented above.