In response to my last post about dd
, a friend of mine
noticed that GNU cp
always uses a 128 KB buffer size when copying a regular
file; this is also the buffer size used by GNU cat
. If you use strace
to
watch what happens when copying a file, you should see a lot of 128 KB
read/write sequences:
$ strace -s 8 -xx cp /dev/urandom /dev/null
...
read(3, "\x61\xca\xf8\xff\x1a\xd6\x83\x8b"..., 131072) = 131072
write(4, "\x61\xca\xf8\xff\x1a\xd6\x83\x8b"..., 131072) = 131072
read(3, "\xd7\x47\x8f\x09\xb2\x3d\x47\x9f"..., 131072) = 131072
write(4, "\xd7\x47\x8f\x09\xb2\x3d\x47\x9f"..., 131072) = 131072
read(3, "\x12\x67\x90\x66\xb7\xed\x0a\xf5"..., 131072) = 131072
write(4, "\x12\x67\x90\x66\xb7\xed\x0a\xf5"..., 131072) = 131072
read(3, "\x9e\x35\x34\x4f\x9d\x71\x19\x6d"..., 131072) = 131072
write(4, "\x9e\x35\x34\x4f\x9d\x71\x19\x6d"..., 131072) = 131072
...
As you can see, each copy is operating on buffers 131072 bytes in size, which is
128 KB. GNU cp
is part of the GNU coreutils project, and if you go diving into
the coreutils source code you'll find this buffer size is defined in the
file
src/ioblksize.h.
The comments in this file are really fascinating. The author of the code in this
file (Jim Meyering) did a benchmark using dd if=/dev/zero of=/dev/null
with
different values of the block size parameter, bs
. On a wide variety of
systems, including older Intel CPUs, modern high-end Intel CPUs, and even an IBM
POWER7 CPU, a 128 KB buffer size is fastest. I used gnuplot to graph these
results, shown below. Higher transfer rates are better, and the different
symbols represent different system configurations.
Most of the systems get faster transfer rates as the buffer size approaches 128 KB. After that, performance generally degrades slightly.
The file includes a cryptic, but interesting, explanation of why 128 KB is the best buffer size. Normally with these system calls it's more efficient to use larger buffer sizes. This is because the larger the buffer size used, the fewer system calls need to be made. So why the drop off in performance when a buffer larger than 128 KB is used?
When copying a file, GNU cp
will first
call
posix_fadvise(2) on
the source file with POSIX_FADV_SEQUENTIAL
as the "advice" flag. As the name
implies, this gives a hint to the kernel that cp
plans to scan the source file
sequentially. This causes the Linux kernel to use "readahead" for the file. On
Linux you can also initiate readahead
using madvise(2). There's
also a system call actually
called readahead(2),
but it has a slightly different use case.
When you read(2)
data from a regular file, if you're lucky some or all of the
data you plan to read will already be in the kernel's page cache. The page cache
is a cache of disk pages stored in kernel memory. Normally this works on an LRU
basis, so when you read a page from disk the kernel first checks the page cache,
and if the page isn't in the cache it reads it from disk and copies it into the
page cache (possibly evicting an older page from the cache). This means the
first access to a disk page actually requires going to disk, but subsequent
accesses can simply copy the data from main memory if the disk page is still in
the page cache.
When the kernel initiates readahead, it makes a best effort to prefetch pages that it thinks will be needed imminently. In particular, when accessing a file sequentially, the kernel will attempt to prefetch upcoming parts of the file as the file is read. When everything is working correctly, one can get a high cache hit rate even if the file contents weren't already in the page cache when the file was initially opened. In fact, if the file is actually accessed sequentially, there's a good chance of getting a 100% hit rate from the page cache when the kernel is doing readahead.
There's a trade-off here, because if the kernel prefetches pages more
aggressively there will be a higher cache hit rate; but if the kernel is too
aggressive, it may wastefully prefetch pages that aren't actually going to be
read. What actually happens is the kernel has a readahead buffer size configured
for each block device, and the readahead kernel thread will prefetch at most
that much data for files on that block device. You can see the readahead buffer
size using the blockdev
command:
# Get the readahead size for /dev/sda
$ blockdev --getra /dev/sda
256
The units returned by blockdev
are in terms of 512 byte "sectors" (even though
my Intel SSD doesn't actually have
true disk sectors). Thus a return
value of 256 actually corresponds to a 128 KB buffer size. You can see how this
is actually implemented by the kernel in the
file
mm/readahead.c,
in particular in the method ondemand_readahead()
which calls
get_init_ra_size()
. From my non-expert reading of the code, it appears that
the code tries to look at the number of pages in the file, and for large files a
maximum value of 128 KB is used. Note that this is highly specific to Linux:
other Unix kernels may or may not implement readahead, and if they do there's no
guarantee that they'll use the same readahead buffer size.
So how is this related to disk transfer rates? As noted earlier, typically one
wants to minimize the number of system calls made, as each system call has
overhead. In this case that means we want to use as large a buffer size as
possible. On the other hand, performance will be best when the page cache hit
rate is high. A buffer size of 128 KB fits both of these constraints---it's the
maximum buffer size that can be used before readahead will stop being effective.
If a larger buffer size is used, read(2)
calls will block while kernel waits
for the disk to actually return new data.
In the real world a lot of other things will be happening on the host, so there's no guarantee that the stars will align perfectly. If the disk is very fast, the effect of readahead is diminished, so the penalty for using a larger buffer size might not be as bad. It's also possible to race the kernel here: a userspace program could try to read a file faster than the kernel can prefetch pages, which will make readahead less effective. But on the whole, we expect a 128 KB buffer size to be most effective, and that's exactly what the benchmark above demonstrates.