Linux Swappiness

My earlier post about mlockall() caused me to look a bit more into the "swappiness" setting on Linux. It turns out that by chance this was related to a problem I had been looking at at work where we unexpectedly saw some MySQL processes getting OOM killed even though the various buffer settings seemed correct. While researching this I read about a recent Linux kernel change that can affect how aggressively the OOM killer operates. The change is Linux kernel commit fe35004f which was introduced in the Linux 3.5-rc1 kernel by Satoru Moriya and changes the behavior when the sysctl vm.swappiness is set to 0. This is a very short diff (only three lines changed). This change is mentioned on the Percona blog, in the RHEL documentation, and in many other websites on the internet. The short version is: since this change was introduced processes are much more likely to be OOM killed when swappiness is set to 0, even if there is free memory on the system. I spent some time trying to understand the change, since it's only three lines long. What I found was pretty interesting.

The kernel divides the memory on your system into a number of "zones" which are related to the address range and what NUMA node they're associated with. You can view information about the zones in the file /proc/zoneinfo. I also found a great post from Chris Siebenmann that explains how these zones work.

The kernel tries to reserve a certain amount of memory in each zone, which is explained at this linux-mm page. You can view the sum of the reserved memory in the file /proc/sys/vm/min_free_kbytes, but you can also view the per-zone statistics in /proc/zoneinfo. When the number of free pages on a system is less than the low water mark then kswapd will try to reclaim some page. In some cases it can evict pages from memory, in some cases it can page data out to the swap file/partition, and in some cases it will invoke the OOM killer.

Note that it is possible that the OOM killer can be invoked even if there is free memory on the system, or in some cases even if there is free swap space. For instance, in the Percona article I listed earlier you can see that the host where the MySQL process was OOM killed had 4GB allocated to swap, was actually using 0 bytes of swap, and that there were a lot of pages in the pagecache. However, you can also see that the "Normal" zone is low because there are only 42436kB free and the low water mark has been calculated on that host as 52892kB. In this situation kswapd has to decide how to reclaim pages.

If the kernel tries to reclaim pages without an OOM kill it can get them from four possible places:

The kernel method get_scan_count() defined in mm/vmscan.c is relevant here, and is the function changed by Satoru's patch. You can see in this function it calculates:

/*
 * With swappiness at 100, anonymous and file have the same priority.
 * This scanning priority is essentially the inverse of IO cost.
 */
anon_prio = swappiness;
file_prio = 200 - anon_prio

These values are then used as scalars to affect the priority of anonymous pages versus file pages, they are used later like so:

/*
 * The amount of pressure on anon vs file pages is inversely
 * proportional to the fraction of recently scanned pages on
 * each list that were recently referenced and in active use.
 */
ap = anon_prio * (reclaim_stat->recent_scanned[0] + 1);
ap /= reclaim_stat->recent_rotated[0] + 1;

fp = file_prio * (reclaim_stat->recent_scanned[1] + 1);
fp /= reclaim_stat->recent_rotated[1] + 1;

Prior to Satoru's patch, the lines calculating the initial values for ap and fp looked like:

ap = (1 + anon_prio) * (reclaim_stat->recent_scanned[0] + 1);
fp = (1 + file_prio) * (reclaim_stat->recent_scanned[1] + 1);

So previously if you set swappiness = 0 then ap would end up as a small number and fp would end up as a large number; but now ap will always be 0. Later in the method these values are used as part of a fraction where ap or fp is the numerator in the fraction. What ends up happening is that when swappiness is set to 0 then get_scan_count() will say now say no anonymous pages will be scanned, only file pages should be scanned. Previously it would have still scanned some anonymous pages.

Because these numbers are proportions, even if ap was a low number if there were a lot of anonymous pages allocated then the kernel can still decide to start swapping a significant number of these pages out. For instance, imagine that there are 1000x as many anonymous pages as file pages. Then the small propotion for ap could still have a significant effect on the total number of anonymous pages that can be paged out. This is why the change of ap from a low number with the old code to zero with the new code is significant.

The actual method that shrinks the various LRU vectors is called shrink_lruvec() and calls get_scan_count(). This method will scan at least as many pages as requested by get_scan_count() but can scan more. What will happen in this method is that since now get_scan_count() is told to only evict file pages it will only evict these pages each time it's run. At some point obviously all of the file pages will have been evicted and there are only anonymous pages remaining. When this happens the main loop is exited, and the struct scan_control that is being passed around will have nr_reclaimed < nr_to_reclaim. Therefore the kernel will know that we're in trouble.

As documented in sysctl/vm.txt, if swappiness is 0 what will happen in this case is the OOM killer will be invoked if the number of file backed pages plus the number of free pages is less than the high watermark. This gets back to my previous statement about how the OOM killer might run even if there is swap space available. When swappiness is set to 0 then the number of file backed pages should be very low, because the kernel has prioritized swapping out file backed pages at all other costs. This makes it very likely for the OOM killer to run.

In the Percona article I linked to earlier you can see in the last line of the zone information the following four values for the "Normal" zone:

Since 42436kB + 15616kB = 58052kB, and this is less then the high water mark of 63472kB, the kernel will decide not to swap. What's interesting is that in this case the page cache size is not a factor. Somewhere earlier in the kswapd code the kernel had calculated an appropriate number of page cache pages to drop. The code for dropping page cache entries is in shrink_slab() which is called by shrink_zones(). I don't totally understand this code, but basically at some point in the shrink_zones() function the kernel decides on an appropriate number of page cache entries to reclaim. At a certain threshold it's easier to OOM kill a process than it is to reclaim more memory from the page cache.

There is some interesting documentation about this process on kernel.org, but it is a bit out of date. It describes the details of how the process works on the Linux 2.4 kernel, and has a brief "What's New In 2.6" section. Some of the methods it refers to such as shrink_cache() and shrink_caches() are defined if you look at the 2.4.x releases but not in modern releases. From my brief tour of the code it looks like the 2.4.x algorithm works at a very high level like it does in the current release I've been looking at (4.3.3), and some of the variables and functions are indeed the same, but a lot of the details have changed. This is something I'd like to spend more time investigating.

I would strongly advise against setting vm.swappiness to 0. If you want the old behavior you can set it to 1 and the calculation will work as before. Red Hat suggests setting it 10 on database class hosts.