My earlier post about
mlockall() caused me to look a bit
more into the "swappiness" setting on Linux. It turns out that by chance this
was related to a problem I had been looking at at work where we unexpectedly saw
some MySQL processes getting OOM killed even though the various buffer settings
seemed correct. While researching this I read about a recent Linux kernel change
that can affect how aggressively the OOM killer operates. The change is Linux
which was introduced in the Linux 3.5-rc1 kernel by Satoru Moriya and changes
the behavior when the sysctl
vm.swappiness is set to 0. This is a very short
diff (only three lines changed). This change is mentioned on the
and in many other websites on the internet. The short version is: since this
change was introduced processes are much more likely to be OOM killed when
swappiness is set to 0, even if there is free memory on the system. I spent some
time trying to understand the change, since it's only three lines long. What I
found was pretty interesting.
The kernel divides the memory on your system into a number of "zones" which are
related to the address range and what NUMA node they're associated with. You can
view information about the zones in the file
/proc/zoneinfo. I also found a
post from Chris Siebenmann
that explains how these zones work.
The kernel tries to reserve a certain amount of memory in each zone, which is
explained at this linux-mm page. You can
view the sum of the reserved memory in the file
but you can also view the per-zone statistics in
/proc/zoneinfo. When the
number of free pages on a system is less than the low water mark then
will try to reclaim some page. In some cases it can evict pages from memory, in
some cases it can page data out to the swap file/partition, and in some cases it
will invoke the OOM killer.
Note that it is possible that the OOM killer can be invoked even if there is
free memory on the system, or in some cases even if there is free swap space.
For instance, in the Percona article I listed earlier you can see that the host
where the MySQL process was OOM killed had 4GB allocated to swap, was actually
using 0 bytes of swap, and that there were a lot of pages in the pagecache.
However, you can also see that the "Normal" zone is low because there are only
42436kB free and the low water mark has been calculated on that host as 52892kB.
In this situation
kswapd has to decide how to reclaim pages.
If the kernel tries to reclaim pages without an OOM kill it can get them from four possible places:
- anonymous inactive pages
- anonymous active pages
- file inactive pages
- file active pages
The kernel method
get_scan_count() defined in
is relevant here, and is the function changed by Satoru's patch. You can see in
this function it calculates:
/* * With swappiness at 100, anonymous and file have the same priority. * This scanning priority is essentially the inverse of IO cost. */ anon_prio = swappiness; file_prio = 200 - anon_prio
These values are then used as scalars to affect the priority of anonymous pages versus file pages, they are used later like so:
/* * The amount of pressure on anon vs file pages is inversely * proportional to the fraction of recently scanned pages on * each list that were recently referenced and in active use. */ ap = anon_prio * (reclaim_stat->recent_scanned + 1); ap /= reclaim_stat->recent_rotated + 1; fp = file_prio * (reclaim_stat->recent_scanned + 1); fp /= reclaim_stat->recent_rotated + 1;
Prior to Satoru's patch, the lines calculating the initial values for
fp looked like:
ap = (1 + anon_prio) * (reclaim_stat->recent_scanned + 1); fp = (1 + file_prio) * (reclaim_stat->recent_scanned + 1);
So previously if you set
swappiness = 0 then
ap would end up as a small
fp would end up as a large number; but now
ap will always be 0.
Later in the method these values are used as part of a fraction where
fp is the numerator in the fraction. What ends up happening is that when
swappiness is set to 0 then
get_scan_count() will say now say no anonymous
pages will be scanned, only file pages should be scanned. Previously it would
have still scanned some anonymous pages.
Because these numbers are proportions, even if
ap was a low number if there
were a lot of anonymous pages allocated then the kernel can still decide to
start swapping a significant number of these pages out. For instance, imagine
that there are 1000x as many anonymous pages as file pages. Then the small
ap could still have a significant effect on the total number of
anonymous pages that can be paged out. This is why the change of
ap from a low
number with the old code to zero with the new code is significant.
The actual method that shrinks the various LRU vectors is called
shrink_lruvec() and calls
get_scan_count(). This method will scan at least
as many pages as requested by
get_scan_count() but can scan more. What will
happen in this method is that since now
get_scan_count() is told to only evict
file pages it will only evict these pages each time it's run. At some point
obviously all of the file pages will have been evicted and there are only
anonymous pages remaining. When this happens the main loop is exited, and the
struct scan_control that is being passed around will have
nr_reclaimed < nr_to_reclaim. Therefore the kernel will know that we're in trouble.
As documented in
swappiness is 0 what will happen in this case is the OOM killer will be
invoked if the number of file backed pages plus the number of free pages is less
than the high watermark. This gets back to my previous statement about how the
OOM killer might run even if there is swap space available. When swappiness is
set to 0 then the number of file backed pages should be very low, because the
kernel has prioritized swapping out file backed pages at all other costs. This
makes it very likely for the OOM killer to run.
In the Percona article I linked to earlier you can see in the last line of the zone information the following four values for the "Normal" zone:
- 42436kB free
- 52892kB low water mark
- 63472kB high water mark
- 15616kB mapped
Since 42436kB + 15616kB = 58052kB, and this is less then the high water mark of
63472kB, the kernel will decide not to swap. What's interesting is that in this
case the page cache size is not a factor. Somewhere earlier in the
the kernel had calculated an appropriate number of page cache pages to drop. The
code for dropping page cache entries is in
shrink_slab() which is called by
shrink_zones(). I don't totally understand this code, but basically at some
point in the
shrink_zones() function the kernel decides on an appropriate
number of page cache entries to reclaim. At a certain threshold it's easier to
OOM kill a process than it is to reclaim more memory from the page cache.
There is some interesting documentation about this process on
but it is a bit out of date. It describes the details of how the process works
on the Linux 2.4 kernel, and has a brief "What's New In 2.6" section. Some of
the methods it refers to such as
defined if you look at the 2.4.x releases but not in modern releases. From my
brief tour of the code it looks like the 2.4.x algorithm works at a very high
level like it does in the current release I've been looking at (4.3.3), and some
of the variables and functions are indeed the same, but a lot of the details
have changed. This is something I'd like to spend more time investigating.
I would strongly advise against setting
vm.swappiness to 0. If you want the
old behavior you can set it to 1 and the calculation will work as before. Red
Hat suggests setting it 10 on database class hosts.