Blocking I/O, Nonblocking I/O, And Epoll

January 10, 2017

In this post I want to explain exactly what happens when you use nonblocking I/O. In particular, I want to explain:

The semantics of setting O_NONBLOCK on a file descriptor using fcntl
How nonblocking I/O is different from asynchronous I/O
Why nonblocking I/O is frequently used in conjunction with I/O multiplexers like select, epoll, and kqueue
How nonblocking mode interacts with edge-triggered polling with epoll

Blocking Mode

By default, all file descriptors on Unix systems start out in "blocking mode". That means that I/O system calls like read, write, or connect can block. A really easy way to understand this is to think about what happens when you read data on stdin from a regular TTY-based program. If you call read on stdin then your program will block until data is actually available, such as when the user actually physically types characters on their keyboard. Specifically, the kernel will put the process into the "sleeping" state until data is available on stdin. This is also the case for other types of file descriptors. For instance, if you try to read from a TCP socket then the read call will block until the other side of the connection actually sends data.

Blocking is a problem for programs that should operate concurrently, since blocked processes are suspended. There are two different, complementary ways to solve this problem. They are:

Nonblocking mode
I/O multiplexing system calls, such as select and epoll

These two solutions are often used together, but they are independent strategies to solving this problem, and often both are used. In a moment we'll see the difference and why they're commonly both used.

Nonblocking Mode (O_NONBLOCK)

A file descriptor is put into "nonblocking mode" by adding O_NONBLOCK to the set of fcntl flags on the file descriptor:

/* set O_NONBLOCK on fd */
int flags = fcntl(fd, F_GETFL, 0);
fcntl(fd, F_SETFL, flags | O_NONBLOCK);

From this point forward the file descriptor is considered nonblocking. When this happens I/O system calls like read and write that would block will return -1, and errno will be set to EWOULDBLOCK.

This is interesting, but on its own is actually not that useful. With just this primitive there's no efficient way to do I/O on multiple file descriptors. For instance, suppose we have two file descriptors and want to read both of them at once. This could be accomplished by having a loop that checks each file descriptor for data, and then sleeps momentarily before checking again:

struct timespec sleep_interval{.tv_sec = 0, .tv_nsec = 1000};
ssize_t nbytes;
for (;;) {
    /* try fd1 */
    if ((nbytes = read(fd1, buf, sizeof(buf))) < 0) {
        if (errno != EWOULDBLOCK) {
            perror("read/fd1");
        }
    } else {
        handle_data(buf, nbytes);
    }

    /* try fd2 */
    if ((nbytes = read(fd2, buf, sizeof(buf))) < 0) {
        if (errno != EWOULDBLOCK) {
            perror("read/fd2");
        }
    } else {
        handle_data(buf, nbytes);
    }

    /* sleep for a bit; real version needs error checking! */
    nanosleep(sleep_interval, NULL);
}

This works, but has a lot of drawbacks:

When data is coming in very slowly the program will wake up frequently and unnecessarily, which wastes CPU resources.
When data does come in, the program may not read it immediately if it's sleeping, so the latency of the program will be poor.
Handling a large number of file descriptors with this pattern would become cumbersome.

To fix these problems we need an I/O multiplexer.

I/O Multiplexing (select, epoll, kqueue, etc.)

There's a few I/O multiplexing system calls. Examples of I/O multiplexing calls include select (defined by POSIX), the epoll family on Linux, and the kqueue family on BSD. These all work fundamentally the same way: they let the kernel know what events (typically read events and write events) are of interest on a set of file descriptors, and then they block until something of interest happens. For instance, you might tell the kernel you are interested in just read events on file descriptor X, both read and write events on file descriptor Y, and just write events on file descriptor Z.

These I/O multiplexing system calls typically do not care if the file descriptors are in blocking mode or nonblocking mode. You can leave all of your file descriptors in blocking mode and they'll work just fine with select or epoll. If you only call read and write on file descriptors returned by select or epoll the calls won't block, even if those file descriptors are in blocking mode. There's one important exception! The blocking or nonblocking status of a file descriptor is significant for edge-triggered polling, as explained further below.

The multiplexing approach to concurrency is what I call "asynchronous I/O". Sometimes people will call this same approach "nonblocking I/O", which I believe comes from a confusion about what "nonblocking" means at the systems programming level. I suggest reserving the term "nonblocking" for referring to whether or not file descriptors are actually in nonblocking mode or not.

How O_NONBLOCK Interacts With I/O Multiplexing

Let's say we're writing a simple socket server using select with blocking file descriptors. For simplicity, in this example we just have file descriptors we want to read from, which are in read_fds. The core part of the event loop will call select and then invoke read once for each file descriptor with data:

ssize_t nbytes;
for (;;) {
    /* select call happens here */
    if (select(FD_SETSIZE, &read_fds, NULL, NULL, NULL) < 0) {
        perror("select");
        exit(EXIT_FAILURE);
    }
    for (int i = 0; i < FD_SETSIZE; i++) {
        if (FD_ISSET(i, &read_fds)) {
            /* read call happens here */
            if ((nbytes = read(i, buf, sizeof(buf))) >= 0) {
                handle_read(nbytes, buf);
            } else {
                /* real version needs to handle EINTR correctly */
                perror("read");
                exit(EXIT_FAILURE);
            }
        }
    }
}

This works and it's perfectly fine. But, what happens if buf is small, and a lot of data comes down the line? To be concrete, suppose that buf is a 1024-byte buffer but 64KB of data comes in all at once. To handle this request we'll invoke select followed by read 64 times. That's 128 total system calls, which is a lot.

If the buffer size is too small read will have to be called a lot of times, there's no avoiding that. But perhaps we could reduce the number of times we call select? Ideally in this example we would call select only one time.

In fact, this is possible, and it's accomplished by putting the file descriptors into nonblocking mode. The basic idea is that you keep calling read in a loop until it returns EWOULDBLOCK. That looks like this:

ssize_t nbytes;
for (;;) {
    /* select call happens here */
    if (select(FD_SETSIZE, &read_fds, NULL, NULL, NULL) < 0) {
        perror("select");
        exit(EXIT_FAILURE);
    }
    for (int i = 0; i < FD_SETSIZE; i++) {
        if (FD_ISSET(i, &read_fds)) {
            /* NEW: loop until EWOULDBLOCK is encountered */
            for (;;) {
                /* read call happens here */
                nbytes = read(i, buf, sizeof(buf));
                if (nbytes >= 0) {
                    handle_read(nbytes, buf);
                } else {
                    if (errno != EWOULDBLOCK) {
                        /* real version needs to handle EINTR correctly */
                        perror("read");
                        exit(EXIT_FAILURE);
                    }
                    break;
                }
            }
        }
    }
}

In this example (1024-byte buffer with 64KB of data incoming) we'll do 66 system calls: select will be called one time, read will be called without error 64 times, and read will be called and return EWOULDBLOCK one time. This is much better! This is nearly half the number from the previous example, which will improve performance and scalability considerably.

The downside of this approach is that due to the new loop, there's at least one extra read that happens since it's called until it returns EWOULDBLOCK. Let's say that typically the read buffer is large enough to read all of the incoming data in one read call. Then in the usual case through the loop there will be three system calls rather than just two: select to wait for the data, read to actually read the data, and then read again to get EWOULDBLOCK.

Edge-Triggered Polling

There's one more important use of nonblocking I/O: with edge-triggered polling in the epoll system call. This system call has two modes: level-triggered polling, and edge-triggered polling. Level-triggered polling is a simpler programming model that's similar to the classic select system call. To explain the difference we need to understand how epoll works within the kernel.

Suppose you tell the kernel you're interested in using epoll to monitor read events on some file descriptor. The kernel maintains a list of these interests for each file descriptor. When data comes in on the file descriptor the kernel traverses the interests list and wakes up each process that was blocked in epoll_wait with that file descriptor in the event list.

What I outlined above happens regardless of what triggering mode epoll is in. The difference between level-triggered and edge-triggered polling is what happens in the kernel when you call epoll_wait. In level-triggered mode the kernel will traverse each file descriptor in the interest list to see if it already matches the interest condition. For instance, if you registered a read event on file descriptor 8, when calling epoll_wait the kernel will first check: does file descriptor 8 already have data ready for reading? If any of the file descriptors match the interest then epoll_wait can return without blocking.

By contrast, in edge-triggered mode the kernel skips this check and immediately puts the process to sleep when it calls epoll_wait. This puts all of the responsibility on you, the programmer, to do the Right Thing and fully read and write all data for each file descriptor before waiting on this.

This edge-triggered mode is what makes epoll an O(1) I/O multiplexer: the epoll_wait call will suspend immediately, and since a list is maintained for each file descriptor ahead of time, when new data comes in the kernel immediately knows what processes must be woken up in O(1) time.

Here's a more worked out example of the difference between edge-triggered and level-triggered modes. Suppose your read buffer is 100 bytes, and 200 bytes of data comes in for that file descriptor. Suppose then you call read exactly one time and then call epoll_wait again. There's still 100 bytes of data already ready to read. In level-triggered mode the kernel would notice this and notify the process that it should call read again. By contrast, in edge-triggered mode the kernel would immediately go to sleep. If the other side is expecting a response (e.g. the data it sent is some kind of RPC) then the two sides will "deadlock", as the server will be waiting for the client to send more data, but the client will be waiting for the server to send a response.

To use edge-triggered polling you must put the file descriptors into nonblocking mode. Then you must call read or write until they return EWOULDBLOCK every time. If you fail to meet these conditions you will miss notifications from the kernel. But there's a big upside of doing this: each call to epoll_wait will be more efficient, which can be very important on programs with extremely high levels of concurrency. If you want to learn more about the details I strongly encourage you to read the epoll(7) man page.

Update: I've put up a complete example of using edge-triggered epoll on GitHub: https://github.com/eklitzke/epollet