One common mistake I see in socket programming is incorrect usage of
TCP_NODELAY
(the socket option that disables
Nagle's Algorithm). There's
a ton of advice on the internet where people are promoting the usage of this
socket option as a panacea for network latency. Often the posts I've seen
suggesting that TCP_NODELAY
will be helpful are actually technically correct
but miss the larger picture of the context in which it's appropriate to use this
option, and generally how to write high performance socket programs.
This came up today at work. Someone was trying to load test their service with a program written using the Tornado HTTP client. This library comes in two varieties: a high performance libcurl binding and a pure Python version that just uses the low level Tornado IOStream stuff. In both cases the benchmark was getting poor latency numbers even though the benchmark was set up with a concurrency of ten. The engineer writing the load test refactored it to use the Python multiprocessing module with the regular blocking urllib2 library and got a huge speedup (again, with concurrency ten), in line with the expectation for the service.
A bunch of theories were thrown back and forth about why this was happening. One
of the engineers looking at this had read Julia Evans'
recent blog post about TCP_NODELAY
and suggested that this may be the culprit. The engineer even found a
related Tornado Github issue
where Tornado wasn't setting TCP_NODELAY
in earlier versions. There was some
confusion about which Tornado version we were using in the benchmarks, whether
urllib2 was using TCP_NODELAY
, and so on.
I was kind of skeptical that TCP_NODELAY
was related for a few reasons. First,
in my experience most applications that use this flag are not written correctly
in the first place. Indeed, in the
blog post Julia references
discussing the issue with Ruby's Net::HTTP
module, you can see right off the
bat that the Ruby library is not written well. To wit:
Ruby’s Net::HTTP splits POST requests across two TCP packets - one for the headers, and another for the body. curl, by contrast, combines the two if they’ll fit in a single packet.
Unnecessarily sending data out in multiple packets here is bad programming practice in the first place (more on this in a second). The benchmark that came up at work was making HTTP GET requests which don't have bodies and should fit in a single packet, even if the library was written sloppily. Additionally one variation of the Tornado load tester was using the libcurl bindings, and as noted in this post libcurl properly buffers its writes to send full packets.
In this post I'm going to discuss:
- How the TCP stack in your kernel works, as well as how the protocol itself
works. Too often people understand how the protocol works without
understanding the kernel layer which leads to sloppy code (like the Ruby
Net::HTTP
example). - What to do before turning on
TCP_NODELAY
. - When it is appropriate to use
TCP_NODELAY
. - A fun and related socket option on Linux,
TCP_CORK
.
A Brief Overview of TCP and the Kernel
The programming interface exposed to you by the kernel for TCP sockets
(actually, all SOCK_STREAM
sockets) is a file descriptor that you can
write(2)
and read(2)
on. As you might guess, the write(2)
system call
sends data, and the read(2)
system call reads data.
In reality how this works is that the kernel has two buffers allocated for the
file descriptor in kernel memory: a write buffer and a read buffer. The
write(2)
system call doesn't actually cause TCP packets to get sent; what it
actually does is copy data into the kernel's write buffer for the socket.
Likewise read(2)
fetches data out of the read buffer (or blocks if the read
buffer is empty). There are some further details about blocking and non-blocking
sockets and other advanced socket options, but those details aren't relevant for
this discussion.
When there is unsent data in the write buffer the kernel tries to send it out
over the network. This takes the form of a data payload wrapped in a TCP packet,
which itself is wrapped in an IP packet, which itself is wrapped in a link layer
packet (e.g. an Ethernet frame). The time when this TCP packet it sent is
potentially much later than when you actually issued the write(2)
call that
copied data into the outgoing write buffer. You can't use the write(2)
call to
know if the data has been sent (and even if you could, sent data can be dropped
or lost and may need to be transmitted anyway).
The TCP protocol itself has a rule for when the kernel is allowed to send TCP
packets. There's a
window size for the TCP
connection that tracks how many unacked packets can be in transit at any time.
When this window size is hit, data written by the application will be copied
into the kernel write buffer but not sent immediately. There are two cases when
an application is likely to run into this: (1)
new TCP connections start with a tiny window size
and therefore write delays are common, and (2) applications that are actually
saturating the network bandwidth will eventually run up against the window size.
The situation relevant here is (1), the behavior of a new TCP connection. In the
original RFC for TCP the window size
started at one packet. This means that the
first write(2)
call on a new TCP stream will cause a packet to be immediately
sent, but a quick second write(2)
will not cause data to be sent if the first
packet hasn't been acked. This has been somewhat fixed;
RFC 3390 increases the initial window size
to up to four packets, and there
have been efforts to increase that even
further.
Besides the TCP rules for when it is permissible to send packets, there are some
recommendations for how TCP stacks should work. In some cases your operating
system may choose to delay sending TCP packets even when the current window size
would allow a packet to be sent. There are two relevant options here: Nagle's
algorithm (which can be disabled by TCP_NODELAY
) and TCP delayed
acknowledgement (which on Linux sort of can be controlled by TCP_QUICKACK
,
with some caveats).
Nagle's algorithm notes that a userspace application might be making lots of
small calls to write(2)
it may be beneficial to have the kernel try to
coalesce the writes. If the kernel naively translated all of these writes into
TCP packets then the kernel would end up sending a lot of small packets. For
instance, consider the statsd protocol. In
this protocol a stat is sent by the stat name, a value, and optionally a
"format". A typical stat sent in this protocol is very small; say, 50 bytes. An
application that is trying to send 10 stats might therefore issue 10 calls to
write(2)
each sending 50 bytes. It's easy to imagine how this would be the
case: you structure your library so there's a send_stat()
function which
writes the data, and then call that function 10 times. What Nagle's algorithm
does is says: if there is unacked data sent, and if the write buffer in the
kernel is smaller than the
MTU, then wait a
little to see if the application writes more data. If the write buffer reaches
the MTU size then the data will be transmitted. If the in flight data is acked
then the data will also be transmitted, even if it is smaller than the MTU. In
the example I just gave, this would mean that the kernel will automatically
coalesce the stats sent into a single 500 byte packet. As you can see, this is a
really good thing, and generally not something you want to disable.
The TCP delayed ack feature is again an attempt to minimize the number of small packets sent. The way it works is a TCP packet can ack multiple data packets at once. Therefore a TCP stack implementing the delayed ack feature may wait up to some amount of time before acking packets in the hope that it will be able to ack more packets at once. On Linux this can cause up to a 40 ms delay when acking packets. Again, this is usually a good thing since it decreases the number of packets that have to be sent (which is usually the limiting factor in network performance).
The Problem
The problem happens if Nagle's algorithm causes a sent packet to be delayed and when the sending program needs the packet to be sent before it can read data on the connection. This situation is actually problematic even if Nagle's algorithm alone is on, but delayed acks amplify the problem.
If a program writes a single packet and then reads, then the packet will be transmitted immediately (since there are is no unacked data the rule about unfull buffers doesn't apply). The opposite end will process the packet and then send its response (which will include an ack flag). Thus, there is no unnecessary delay.
If a program issues multiple writes and doesn't have to read anything then there is no problem. The worse case here is that delayed ack is turned on which means that at the end of the write sequence there is at most a 40 ms artificial delay before the sender knows the data is acked. However if the sender isn't blocked while it waits for a response (e.g. because there are no responses, or because the communication is fully bidirectional) then this won't be noticed.
But consider what happens when a program issues multiple writes followed by a read that is dependent on the write data. The final write may queue due to Nagle's algorithm. The other end won't process the request (and therefore will not start sending a response) because it's waiting for the final packet of data. The other end also won't ack the data it got until the delayed ack timer fires. When this finally does happen the sending program gets its acks and is finally able to actually transmit its last packet. This more or less guarantees the 40 ms delay (unless the write payload just happened to exactly be a multiple of the MTU, which is unlikely), a delay which will be noticed in many applications.
When Not To Use TCP_NODELAY
Off the bat, you should try to send as much data to the kernel as possible in
write(2)
system calls. If that's difficult to do because of how requests are
crafted then you should investigate the
writev(2)
system call which implements
"vectorized I/O". This makes it easier to write applications that generate data
to be sent into multiple small buffers and want to efficiently copy the data
into the kernel.
In the Ruby example, if the Net::HTTP
module issued a single write(2)
or
writev(2)
call then the POST request would have likely fit into a single
packet which would have avoided the need to use TCP_NODELAY
. It's still
possible that even if this happened the POST body could be sufficiently large to
cause the request to be split. However this shouldn't be too common for most web
applications. A good rule of thumb is you can typically fit at least 1400 bytes
of payload before the packet will be split.
When And How to Use TCP_NODELAY
An application that is very latency sensitive, particularly if it doesn't
transmit a lot of data, can safely use TCP_NODELAY
. A good example here would
be an SSH session. In an SSH session it's common that the user will be typing
input to the remote server, and the user will certainly be typing at a slow pace
compared to the bandwidth capabilites of the network. If Nagle's algorithm is on
then that means that keystrokes will end up being delayed frequently. This will
make a lot of applications difficult to use---imagine using vi or emacs with
Nagle's algorithm on.
If you understand this and you want to enable TCP_NODELAY
anyway, then you
definitely can as along as you follow one rule: you must make your best effort
to send large writes to the kernel. That is, it is incumbent on you to ensure
that small amounts of data in the send buffer are actually because there's no
more data to be sent, not because your application is doing something hacky and
is issuing lots of small write(2)
system calls. Again I'd like to encourage
the use of vectorized I/O here. For instance if you have a line oriented
protocol you may find it easiest to structure your program by creating multiple
buffers each storing a single line, and then you can transmit all of the data at
once in a single syscall (without unnecessary memory copies!) using writev(2)
.
Multi-threaded programs that share sockets between threads without locks are particularly susceptible to problems related to Nagle's algorithm. This is because even if the application is structured so that a single thread doesn't do the write-write-read pattern, if there is one writer thread and another reader thread then the pattern can still manifest. This is one good argument for not sharing sockets between threads without mutexes; usually a better option is to use an event loop.
Note that if you are using a proxy like HAProxy then all bets are off. Even if you disable Nagle's algorithm, the proxy could have it on.
TCP_CORK, or, Reverse TCP_NODELAY
Linux implements an interesting option called TCP_CORK
. You can think of it as
an even stronger variation of Nagle's algorithm. It says: if there is data in
the write buffer, and that data is less than the MTU size, and if there is no
unacked data, continue to delay sending the write buffer data up to 200ms in the
hope that more data will enter the write buffer. In other words: it tells the
kernel to try extra hard to delay sending small packets, allowing the kernel to
delay sending such packets by up to 200ms.
This is something that I added to
statsrelay which significantly increased
the throughput of that program. Prior to using this option many programs would
issue tons of small writes into statsrelay. This would cause lots of small
packets to be sent over the network (even though TCP_NODELAY
was not on). In
some cases we had network interfaces that were reaching their max
packets-per-second due to all of these small packets even though the total
bandwidth was well below what the interface could handle. After turning on
TCP_CORK
the number of packets per second the statsd boxes received dropped
dramatically (up to 10x if I recall correctly). This comes at the cost of
some latency, but the underlying data store (Carbon) only supported second
resolution for data anyway so in practice this was not a problem.
A Final Note
If you made it this far, you may wonder what the original problem was with the Tornado load testing application. It ended up actually being an issue with how the futures were being created and collected. All of the possible requests the application could make were created as futures at once with timestamp of when the future was created. However, only ten futures at a time would actually be active. This meant that the last futures being collected were showing long response times even though the actual requests were much faster.