In response to my last post about
dd, a friend of mine
noticed that GNU
cp always uses a 128 KB buffer size when copying a regular
file; this is also the buffer size used by GNU
cat. If you use
watch what happens when copying a file, you should see a lot of 128 KB
$ strace -s 8 -xx cp /dev/urandom /dev/null ... read(3, "\x61\xca\xf8\xff\x1a\xd6\x83\x8b"..., 131072) = 131072 write(4, "\x61\xca\xf8\xff\x1a\xd6\x83\x8b"..., 131072) = 131072 read(3, "\xd7\x47\x8f\x09\xb2\x3d\x47\x9f"..., 131072) = 131072 write(4, "\xd7\x47\x8f\x09\xb2\x3d\x47\x9f"..., 131072) = 131072 read(3, "\x12\x67\x90\x66\xb7\xed\x0a\xf5"..., 131072) = 131072 write(4, "\x12\x67\x90\x66\xb7\xed\x0a\xf5"..., 131072) = 131072 read(3, "\x9e\x35\x34\x4f\x9d\x71\x19\x6d"..., 131072) = 131072 write(4, "\x9e\x35\x34\x4f\x9d\x71\x19\x6d"..., 131072) = 131072 ...
As you can see, each copy is operating on buffers 131072 bytes in size, which is
128 KB. GNU
cp is part of the GNU coreutils project, and if you go diving into
the coreutils source code you'll find this buffer size is defined in the
The comments in this file are really fascinating. The author of the code in this
file (Jim Meyering) did a benchmark using
dd if=/dev/zero of=/dev/null with
different values of the block size parameter,
bs. On a wide variety of
systems, including older Intel CPUs, modern high-end Intel CPUs, and even an IBM
POWER7 CPU, a 128 KB buffer size is fastest. I used gnuplot to graph these
results, shown below. Higher transfer rates are better, and the different
symbols represent different system configurations.
Most of the systems get faster transfer rates as the buffer size approaches 128 KB. After that, performance generally degrades slightly.
The file includes a cryptic, but interesting, explanation of why 128 KB is the best buffer size. Normally with these system calls it's more efficient to use larger buffer sizes. This is because the larger the buffer size used, the fewer system calls need to be made. So why the drop off in performance when a buffer larger than 128 KB is used?
When copying a file, GNU
cp will first
the source file with
POSIX_FADV_SEQUENTIAL as the "advice" flag. As the name
implies, this gives a hint to the kernel that
cp plans to scan the source file
sequentially. This causes the Linux kernel to use "readahead" for the file. On
Linux you can also initiate readahead
using madvise(2). There's
also a system call actually
but it has a slightly different use case.
read(2) data from a regular file, if you're lucky some or all of the
data you plan to read will already be in the kernel's page cache. The page cache
is a cache of disk pages stored in kernel memory. Normally this works on an LRU
basis, so when you read a page from disk the kernel first checks the page cache,
and if the page isn't in the cache it reads it from disk and copies it into the
page cache (possibly evicting an older page from the cache). This means the
first access to a disk page actually requires going to disk, but subsequent
accesses can simply copy the data from main memory if the disk page is still in
the page cache.
When the kernel initiates readahead, it makes a best effort to prefetch pages that it thinks will be needed imminently. In particular, when accessing a file sequentially, the kernel will attempt to prefetch upcoming parts of the file as the file is read. When everything is working correctly, one can get a high cache hit rate even if the file contents weren't already in the page cache when the file was initially opened. In fact, if the file is actually accessed sequentially, there's a good chance of getting a 100% hit rate from the page cache when the kernel is doing readahead.
There's a trade-off here, because if the kernel prefetches pages more
aggressively there will be a higher cache hit rate; but if the kernel is too
aggressive, it may wastefully prefetch pages that aren't actually going to be
read. What actually happens is the kernel has a readahead buffer size configured
for each block device, and the readahead kernel thread will prefetch at most
that much data for files on that block device. You can see the readahead buffer
size using the
# Get the readahead size for /dev/sda $ blockdev --getra /dev/sda 256
The units returned by
blockdev are in terms of 512 byte "sectors" (even though
my Intel SSD doesn't actually have
true disk sectors). Thus a return
value of 256 actually corresponds to a 128 KB buffer size. You can see how this
is actually implemented by the kernel in the
in particular in the method
ondemand_readahead() which calls
get_init_ra_size(). From my non-expert reading of the code, it appears that
the code tries to look at the number of pages in the file, and for large files a
maximum value of 128 KB is used. Note that this is highly specific to Linux:
other Unix kernels may or may not implement readahead, and if they do there's no
guarantee that they'll use the same readahead buffer size.
So how is this related to disk transfer rates? As noted earlier, typically one
wants to minimize the number of system calls made, as each system call has
overhead. In this case that means we want to use as large a buffer size as
possible. On the other hand, performance will be best when the page cache hit
rate is high. A buffer size of 128 KB fits both of these constraints---it's the
maximum buffer size that can be used before readahead will stop being effective.
If a larger buffer size is used,
read(2) calls will block while kernel waits
for the disk to actually return new data.
In the real world a lot of other things will be happening on the host, so there's no guarantee that the stars will align perfectly. If the disk is very fast, the effect of readahead is diminished, so the penalty for using a larger buffer size might not be as bad. It's also possible to race the kernel here: a userspace program could try to read a file faster than the kernel can prefetch pages, which will make readahead less effective. But on the whole, we expect a 128 KB buffer size to be most effective, and that's exactly what the benchmark above demonstrates.