Unix systems define these following related four system calls:
int mlock(const void *addr, size_t len);
int munlock(const void *addr, size_t len);
int mlockall(int flags);
int munlockall(void);
These calls are frequently misunderstood and abused.
The purpose of mlock()
is to "lock" one or more pages of memory into RAM.
These "locked" pages will not swapped out to the swap area under any condition.
As you might guess, munlock()
is analogous and lets you unlock pages that were
previously locked. The system calls mlockall()
and munlock()
are similar but
instead say "lock all of the pages for my process into RAM, no matter what". The
flags
argument to mlockall()
controls whether or not future pages are locked
as well.
The Linux man pages for these system calls actually do a great job of explaining the details of how and why to use these system calls in the "Notes" section, which of course few people read. I'm going to do my best in this article to explain what these calls are for and why you may be using them wrong in this article.
What People Think They're For
Some people think that these system calls are a good way to improve the
performance of a high-performance process on a system. A common use case I've
seen in the real world is to try to call mlockall()
on a program that's
supposed to running with very high performance. The reasoning is that if the
program is paged out to disk, that will reduce performance; therefore
mlockall()
will improve things.
If you try to actually use mlockall()
in this way you might run into some
difficulties because most systems have a very low default ulimit on the number
of pages a process can lock. With some twiddling of the default ulimits you can
get this working, but perhaps it's worth considering why the default ulimits are
so low in the first place.
What They're Actually For
There are two real use cases for using mlock()
:
- programs that need to store passwords or sensitive decrypted data in memory
- programs that need to operate in a real-time environment
Let's consider these separately. The first case is actually the most common
legitimate use of mlock()
that I've seen.
Consider how the ssh-agent
program works. On your computer you have an SSH key
which hopefully you've configured to be encrypted with a password (if you
haven't done this, go do it now). Ordinarily this would mean that every time you
have to SSH to a host you'd have to type your SSH password to decrypt it for the
SSH program. This would be tedious and annoying. Therefore the OpenSSH client
library ships with a program called ssh-agent
. The agent's job is to store
decrypted SSH keys in memory securely, and then automatically pass them to the
SSH program when needed. Therefore you have to only decrypt the SSH key once,
and from then on the decrypted key will be stored in memory in the SSH process.
Imagine there was someone who really wanted to steal your SSH key. In the scheme I just described, here's one attack that they might do:
- force the machine that has
ssh-agent
running to start swapping heavily - wait until the
ssh-agent
process is paged out to the swap area - power off your computer
- take out the hard drive, connect it to another computer, and then examine the pages in the swap area until they find ssh-agent's pages
- try to identify which page has the decrypted key (e.g. by testing all possibly keys aligned on page boundaries, by looking for characteristics of the struct holding decrypted keys, etc.)
This is definitely a real attack. But think about what it requires:
- the attacker doesn't have root on your machine (because if they did they could trivially inspect the memory of the ssh-agent process, or connect to the ssh-agent local socket)
- the attacker has a way to force your machine to swap heavily
- the attacker can force your machine to be powered off and access the hard drive
In real life, these scenarios are very unlikely. So while it's possible in
theory that someone could attack your machine in this way, it's probably not
going to happen in real life. If the ssh-agent
process calls mlock()
on the
pages of memory that contain the decrypted keys then the attack I described is
not possible because the sensitive pages will never be paged out to the swap
area. If you were writing a program to do something like this in an interpreted
language like Python or Java you'd likely have to use mlockall()
since you
generally can't control what pages objects will be allocated in, and in some VMs
(e.g. the Oracle JVM) objects can be relocated anyway.
The other use of mlock()
is for real-time applications. This means
applications that are hard real-time. For instance, imagine you're using a Linux
program to control an industrial welding laser. If some pages of your
application get swapped out to disk, then fetching the pages back to memory
could cause your industrial welding laser to delay switching on or off, which in
some cases could be disastrous. This is absolutely a valid case for mlock()
or
mlockall()
.
Here's the catch with the real-time use case. The Linux kernel itself is not
real-time. There are a ton of algorithms in the kernel that have time complexity
worse than O(1). There are a set of out of tree real-time patches for the
kernel that are maintained at rt.wiki.kernel.org.
So you can use mlock()
or mlockall()
for your real-time program if all of
the following conditions apply:
- your program is written in a language that has hard RT characteristics, e.g. a language like C or C++ that isn't garbage collected, or in a language that has deterministic GC pause times
- you're running on a version of the Linux kernel that has the out-of-tree RT patches applied
Note that in all of these cases, you can also just turn off the swap area and get the same effect.
Alternatives On Linux
It's reasonable to say: I have a program that's supposed to be running with very high performance, and I'd like it to run fast even if the system is heavily under load or if there is memory pressure.
First, you need to understand that there are actually two different things that are going to cause your program to run slowly if the system is under load. The system can be scheduling your process too infrequently, or your could be experiencing problems due to swapping. Let's examine these separately.
High Load/Scheduling Problems
If the system is under high load then you'll get poor performance because the kernel has to run a bunch of other programs in addition to yours. Since it can't run them all at once, your program will only get occassional CPU time, and it will therefore run slowly.
The very first thing you should look at in this situation is the "niceness" value for your process. The default niceness is 0. If you decrease the niceness then your program will get priority when the system is under load, which will improve performance.
If you need something more powerful than nice, have you looked at all of the sched(7) options? On Linux there are a bunch of advanced system calls that let you get fine-tuned control over your scheduling priority. Of course, this assumes that you're not using a language with non-deterministic GC pause times in the first place (I'm looking at you, Java).
Swapping/Memory Pressure Issues
The second issue is swapping, which is what you'd be concerned about if you're
looking at mlock()
or mlockall()
in the first place. First, make sure that
your production systems aren't regularly swapping.
Additionally you might consider looking at the "swappiness" value in the sysctl
vm.swappiness
which can also be accessed via procfs at
/proc/sys/vm/swappiness
. If you're running a process that is expected to
consume nearly all of the memory on the system then you should shower this
value, e.g. to 10. If you're running MySQL with the InnoDB buffer pool and other
buffers set to use nearly all of the machine's memory, e.g. you expect to be
running at 95% or higher memory usage, you want to do this. The same applies if
you're running the JVM and you've set the max heap size to be nearly all of the
system memory with just a small amount left over for system processes.
You can also set the swappiness on a per-process level using cgroups, using the
memory.swappiness
control file in the cgroup. This file has the same
meaning/format as the sysctl. If you use this technique you can make it so a
process like MySQL is less likely to swap out than other processes on the host.
If you have proper limits on your application's max heap size (or equivalently, it's RSS size) and are running a limited other set of things on the system, then you should be able to set the swappiness so you won't ever swap at all.
Disabling Swapping
This is a bit controversial, but I'm a fan of disabling swapping completely on hosts that have redundancy in production. For instance, on your application servers or database slaves (but maybe not on your database master).
My reasoning here is that in production systems swapping is generally an errant condition that should be treated as a hard failure. For instance, let's say you have an application with a memory leak. If the application is leaking memory then the amount of memory it is trying to use will grow and grow indefinitely. At some point this will cause swapping to occur. More and more pages will be paged out to the swap partition, and things will get slow. The application will keep leaking memory. At some point all of the space in the swap partition will be exhausted, and the kernel OOM killer will decide to start killing processes---likely your application that is leaking memory.
In this situation the process is going to get OOM killed no matter what. If you have swapping enabled then the process will get really slow and then get OOM killed. If you don't have swapping enabled then the process will get OOM killed without getting really slow.
You can also avoid this problem using cgroups and limiting the amount of RSS memory the process can use. But this is a little more work and requires tuning on a per-application basis, whereas just disabling swap will cause errant processes to be quickly killed.
Why Misusing mlock(2) and mlock(2) Is Really Bad
But besides all this, why is mlockall()
bad?
Let's assume you have your production machines configured correctly and you
don't normally have any pages swapped out. When you type free
you see 0 under
the swap pages used. Then something bad happens, and you get into the unexpected
situation where your prod machine is swapping.
First, the Linux kernel will use an LRU algorithm to choose what pages to swap out. That means that it will start swapping out the pages that have been least recently used first. This is almost always the best algorithm. Under certain pathological page access conditions it can fail (e.g. scanning pages in a huge circular buffer), but this would be really unusual.
This means that if the kernel decides to swap out one of the pages of your allegedly important process, it probably means the page that you think is so important actually isn't important---because you're not using it! There are a ton of reasons that programs might allocate pages that subsequently don't get used at all, so assuming that every page in your program is sacred and cannot be paged out is a bad assumption.
Second, how do you know that your process' pages are really more important
than another processes' pages? I'll give you an example. Let's say you're
running a high-performance server process that's supposed to use nearly all of
the system's memory, and you've left a small amount of extra memory remaining
for critical system processes like init
, cron
, syslog
, and so forth. If
you mlockall()
your program then in the unexpected case where swapping occurs
you'll cause these processes to start swapping instead. This can be really bad.
For instance, let's say your use of mlockall()
ends up causing syslog to swap.
Then all applications logging to syslog to slow down since syslog will be slow.
This can compound performance problems on the rest of the system since normally
logging to syslog is synchronous. In fact, it can be worse if you have a
"clever" application that is asynchronously logging to syslog with another
thread. The way such an application will be written is that it will likely use a
thread-safe queue to send messages to the logger thread, and the logger thread
will read from the queue and then log to syslog. If syslog gets slow then this
queue could start growing which will increase memory pressure even further. In
the worst case the queue would be unbounded which would eventually lead to
out-of-memory conditions.
This is clearly a pretty complex issue with a lot of factors to consider. If
you're considering using mlockall()
for reasons related to throughput and not
for security or true real-time applications, I'd encourage you to consider the
ramifications of mlockall()
and the alternatives I presented above.