How TCP Sockets Work

In this post I'm going to explain at a high level how the TCP/IP stack works on Linux. In particular, I'll explore how the socket system calls interact with kernel data structures, and how the kernel interacts with the actual network. Part of the motivation for this post is to explain how listen queue overflows work, as it's related to a problem I've been working on at work.

How Established Connections Work

This explanation will be from the top down, so we'll start with how already established connections work. Later I'll explain how newly established connections work.

For each TCP file descriptor tracked by the kernel there is a struct tracking some TCP-specific info (e.g. sequence numbers, the current window size, and so on), as well as a receive buffer (or "queue") and a write buffer (or "queue"). I'll use the terms buffer and queue interchangeably. If you're curious about more details, you can see the implementation of socket structs in the Linux kernel's net/sock.h.

When a new data packet comes in on the network interface (NIC), the kernel is notified either by being interrupted by the NIC, or by polling the NIC for data. Typically whether the kernel is interrupt driven or in polling mode depends on how much network traffic is happening; when the NIC is very busy it's more efficient for the kernel to poll, but if the NIC is not busy CPU cycles and power can be saved by using interrupts. Linux calls this technique NAPI, literally "New API".

When the kernel gets a packet from the NIC it decodes the packet and figures out what TCP connection the packet is associated with based on the source IP, source port, destination IP, and destination port. This information is used to look up the struct sock in memory associated with that connection. Assuming the packet is in sequence, the data payload is then copied into the socket's receive buffer. At this point the kernel will wake up any processes doing a blocking read(2), or that are using an I/O multiplexing system call like select(2) or epoll_wait(2) to wait on the socket.

When the userspace process actually calls read(2) on the file descriptor it causes the kernel to remove the data from its receive buffer, and to copy that data into a buffer supplied to the read(2) system call.

Sending data works similarly. When the application calls write(2) it copies data from the user-supplied buffer into the kernel write queue. Subsequently the kernel will copy the data from the write queue into the NIC and actually send the data. The actual transmission of the data to the NIC could be somewhat delayed from when the user actually calls write(2) if the network is busy, if the TCP send window is full, if there are traffic shaping policies in effect, etc.

One consequence of this design is that the kernel's receive and write queues can fill up if the application is reading too slowly, or writing too quickly. Therefore the kernel sets a maximum size for the read and write queues. This ensures that poorly behaved applications use a bounded amount of memory. For instance, the kernel might cap each of the receive and write queues at 100 KB. Then the maximum amount of kernel memory each TCP socket could use would be approximately 200 KB (as the size of the other TCP data structures is negligible compared to the size of the queues).

Read Semantics

If the receive buffer is empty and the user calls read(2), the system call will block until data is available.

If the receive buffer is nonempty and the user calls read(2), the system call will immediately return with whatever data is available. A partial read can happen if the amount of data ready in the read queue is less than the size of the user-supplied buffer. The caller can detect this by checking the return value of read(2).

If the receive buffer is full and the other end of the TCP connection tries to send additional data, the kernel will refuse to ACK the packets. This is just regular TCP congestion control.

Write Semantics

If the write queue is not full and the user calls write(2), the system call will succeed. All of the data will be copied if the write queue has sufficient space. If the write queue only has space for some of the data then a partial write will happen and only some of the data will be copied to the buffer. The caller checks for this by checking the return value of write(2).

If the write queue is full and the user calls write(2), the system call will block.

How Newly Established Connection Work

In the previous section we saw how established connections use receive and write queues to limit the amount of kernel memory allocated for each connection. A similar technique is used to limit the amount of kernel memory reserved for new connections.

From a userspace point of view, newly established TCP connections are created by calling accept(2) on a listen socket. A listen socket is one that has been designated as such using the listen(2) system call.

The prototype for accept(2) takes a socket and two fields storing information about the other end of the socket. The value returned by accept(2) is an integer representing the file descriptor for a new, established connection:

int accept(int sockfd, struct sockaddr *addr, socklen_t *addrlen);

The prototype for listen(2) takes a socket file descriptor and a backlog parameter:

int listen(int sockfd, int backlog);

The backlog is a parameter that controls how much memory the kernel will reserve for new connections, when the user isn't calling accept(2) fast enough.

For instance, suppose you have a blocking single-threaded HTTP server, and each HTTP request takes about 100 ms. In this scenario the HTTP server will spend 100 ms processing each request before it is able to call accept(2) again. This means that at up to 10 rps there will be no queuing. If more than 10 rps come in the kernel has two choices.

The first choice the kernel has is to not accept the connection at all. For instance, the kernel can just refused to ACK an incoming SYN packet. More commonly what will happen is the kernel will complete the TCP three-way handshake, and then terminate the connection with RST. Either way, the result is the same: no receive or write buffers need to be allocated if the connection is rejected. The argument for doing this is that if the userspace process isn't accepting connections fast enough, the correct thing to do is to fail new requests. The argument against doing this is that it's very aggressive, especially if new connections are "bursty" over time.

The second choice the kernel has is to accept the connection and allocate a socket structure for it (including receive/write buffers), and then queue the socket object for use later. The next time the user calls accept(2), instead of blocking the system call will immediately get the already-allocated socket.

The argument for the second behavior is that it's more forgiving when the processing rate or connection rate tends to burst. For instance, in the server we just described, imagine that 10 new connections come in all at once, and then no more connections come in for the rest of the second. If the kernel queues new connections then all of the requests will be processed over the course of the second. If the kernel had been rejecting new connections then only one of the connections would have succeeded, even though the process was able to keep up with the aggregate request rate.

There are two arguments against queueing. The first is that excessive queueing can cause a lot of kernel memory to be allocated. If the kernel is allocating thousands of sockets with large receive buffers then memory usage can grow quickly, and the userspace process might not even be able to process all of those requests anyway. The other argument against queueing is that it makes the application appear slow to the other side of the connection, the client. The client will see that it can establish new TCP connections, but when it tries to use them it will appear that the server is very slow to respond. The argument is that in this situation it would be better to just fail the new connections, since that provides more obvious feedback that the server is not healthy. Additionally, if the server is aggressively failing new connections the client can know to back off; this is another form of congestion control.

Listen Queues & Overflows

As you might suspect, the kernel actually combines these two approaches. The kernel will queue new connections, but only a certain number of them. The amount of connections the kernel will queue is controlled by the backlog parameter to listen(2). Typically this is set to a relatively small value. On Linux, the socket.h header sets the value of SOMAXCONN to 128, and before kernel 2.4.25 this was the maximum value allowed. Nowadays the maximum value is specified in /proc/sys/net/core/somaxconn, but commonly you'll find programs using SOMAXCONN (or a smaller hard-coded value) anyway.

When the listen queue fills up, new connections will be rejected. This is called a listen queue overflow. You can observe when this is happening by reading /proc/net/netstat and checking the value of ListenOverflows. This is a global counter for the whole kernel. As far as I know, you can't get listen overflow stats per listen socket.

Monitoring for listen overflows is important when writing network servers, because listen overflows don't trigger any user-visible behavior from the server's perspective. The server will happily accept(2) connections all day without returning any indication that connections are being dropped. For example, suppose you are using Nginx as a proxy in front of a Python application. If the Python application is too slow then it can cause the Nginx listen socket to overflow. When this happens you won't see any indication of this in the Nginx logs---you'll keep seeing 200 status codes and so forth as usual. Thus if you're just monitoring the HTTP status codes for your application you'll fail to see that TCP errors are preventing requests from being forwarded to the application.