My earlier post about mlockall()
caused me to look a bit
more into the "swappiness" setting on Linux. It turns out that by chance this
was related to a problem I had been looking at at work where we unexpectedly saw
some MySQL processes getting OOM killed even though the various buffer settings
seemed correct. While researching this I read about a recent Linux kernel change
that can affect how aggressively the OOM killer operates. The change is Linux
kernel commit
fe35004f
which was introduced in the Linux 3.5-rc1 kernel by Satoru Moriya and changes
the behavior when the sysctl vm.swappiness
is set to 0. This is a very short
diff (only three lines changed). This change is mentioned on the
Percona blog,
in the
RHEL documentation,
and in many other websites on the internet. The short version is: since this
change was introduced processes are much more likely to be OOM killed when
swappiness is set to 0, even if there is free memory on the system. I spent some
time trying to understand the change, since it's only three lines long. What I
found was pretty interesting.
The kernel divides the memory on your system into a number of "zones" which are
related to the address range and what NUMA node they're associated with. You can
view information about the zones in the file /proc/zoneinfo
. I also found a
great
post from Chris Siebenmann
that explains how these zones work.
The kernel tries to reserve a certain amount of memory in each zone, which is
explained at this linux-mm page. You can
view the sum of the reserved memory in the file /proc/sys/vm/min_free_kbytes
,
but you can also view the per-zone statistics in /proc/zoneinfo
. When the
number of free pages on a system is less than the low water mark then kswapd
will try to reclaim some page. In some cases it can evict pages from memory, in
some cases it can page data out to the swap file/partition, and in some cases it
will invoke the OOM killer.
Note that it is possible that the OOM killer can be invoked even if there is
free memory on the system, or in some cases even if there is free swap space.
For instance, in the Percona article I listed earlier you can see that the host
where the MySQL process was OOM killed had 4GB allocated to swap, was actually
using 0 bytes of swap, and that there were a lot of pages in the pagecache.
However, you can also see that the "Normal" zone is low because there are only
42436kB free and the low water mark has been calculated on that host as 52892kB.
In this situation kswapd
has to decide how to reclaim pages.
If the kernel tries to reclaim pages without an OOM kill it can get them from four possible places:
- anonymous inactive pages
- anonymous active pages
- file inactive pages
- file active pages
The kernel method get_scan_count()
defined in
mm/vmscan.c
is relevant here, and is the function changed by Satoru's patch. You can see in
this function it calculates:
/*
* With swappiness at 100, anonymous and file have the same priority.
* This scanning priority is essentially the inverse of IO cost.
*/
anon_prio = swappiness;
file_prio = 200 - anon_prio
These values are then used as scalars to affect the priority of anonymous pages versus file pages, they are used later like so:
/*
* The amount of pressure on anon vs file pages is inversely
* proportional to the fraction of recently scanned pages on
* each list that were recently referenced and in active use.
*/
ap = anon_prio * (reclaim_stat->recent_scanned[0] + 1);
ap /= reclaim_stat->recent_rotated[0] + 1;
fp = file_prio * (reclaim_stat->recent_scanned[1] + 1);
fp /= reclaim_stat->recent_rotated[1] + 1;
Prior to Satoru's patch, the lines calculating the initial values for ap
and
fp
looked like:
ap = (1 + anon_prio) * (reclaim_stat->recent_scanned[0] + 1);
fp = (1 + file_prio) * (reclaim_stat->recent_scanned[1] + 1);
So previously if you set swappiness = 0
then ap
would end up as a small
number and fp
would end up as a large number; but now ap
will always be 0.
Later in the method these values are used as part of a fraction where ap
or
fp
is the numerator in the fraction. What ends up happening is that when
swappiness is set to 0 then get_scan_count()
will say now say no anonymous
pages will be scanned, only file pages should be scanned. Previously it would
have still scanned some anonymous pages.
Because these numbers are proportions, even if ap
was a low number if there
were a lot of anonymous pages allocated then the kernel can still decide to
start swapping a significant number of these pages out. For instance, imagine
that there are 1000x as many anonymous pages as file pages. Then the small
propotion for ap
could still have a significant effect on the total number of
anonymous pages that can be paged out. This is why the change of ap
from a low
number with the old code to zero with the new code is significant.
The actual method that shrinks the various LRU vectors is called
shrink_lruvec()
and calls get_scan_count()
. This method will scan at least
as many pages as requested by get_scan_count()
but can scan more. What will
happen in this method is that since now get_scan_count()
is told to only evict
file pages it will only evict these pages each time it's run. At some point
obviously all of the file pages will have been evicted and there are only
anonymous pages remaining. When this happens the main loop is exited, and the
struct scan_control
that is being passed around will have nr_reclaimed < nr_to_reclaim
. Therefore the kernel will know that we're in trouble.
As documented in
sysctl/vm.txt, if
swappiness
is 0 what will happen in this case is the OOM killer will be
invoked if the number of file backed pages plus the number of free pages is less
than the high watermark. This gets back to my previous statement about how the
OOM killer might run even if there is swap space available. When swappiness is
set to 0 then the number of file backed pages should be very low, because the
kernel has prioritized swapping out file backed pages at all other costs. This
makes it very likely for the OOM killer to run.
In the Percona article I linked to earlier you can see in the last line of the zone information the following four values for the "Normal" zone:
- 42436kB free
- 52892kB low water mark
- 63472kB high water mark
- 15616kB mapped
Since 42436kB + 15616kB = 58052kB, and this is less then the high water mark of
63472kB, the kernel will decide not to swap. What's interesting is that in this
case the page cache size is not a factor. Somewhere earlier in the kswapd
code
the kernel had calculated an appropriate number of page cache pages to drop. The
code for dropping page cache entries is in shrink_slab()
which is called by
shrink_zones()
. I don't totally understand this code, but basically at some
point in the shrink_zones()
function the kernel decides on an appropriate
number of page cache entries to reclaim. At a certain threshold it's easier to
OOM kill a process than it is to reclaim more memory from the page cache.
There is some interesting documentation about this process on
kernel.org,
but it is a bit out of date. It describes the details of how the process works
on the Linux 2.4 kernel, and has a brief "What's New In 2.6" section. Some of
the methods it refers to such as shrink_cache()
and shrink_caches()
are
defined if you look at the 2.4.x releases but not in modern releases. From my
brief tour of the code it looks like the 2.4.x algorithm works at a very high
level like it does in the current release I've been looking at (4.3.3), and some
of the variables and functions are indeed the same, but a lot of the details
have changed. This is something I'd like to spend more time investigating.
I would strongly advise against setting vm.swappiness
to 0. If you want the
old behavior you can set it to 1 and the calculation will work as before. Red
Hat suggests setting it 10 on database class hosts.