In this post I want to explain exactly what happens when you use nonblocking I/O. In particular, I want to explain:
- The semantics of setting
O_NONBLOCK
on a file descriptor usingfcntl
- How nonblocking I/O is different from asynchronous I/O
- Why nonblocking I/O is frequently used in conjunction with I/O multiplexers
like
select
,epoll
, andkqueue
- How nonblocking mode interacts with edge-triggered polling with
epoll
Blocking Mode
By default, all file descriptors on Unix systems start out in "blocking mode".
That means that I/O system calls like read
, write
, or connect
can block. A
really easy way to understand this is to think about what happens when you read
data on stdin from a regular TTY-based program. If you call read
on stdin then
your program will block until data is actually available, such as when the user
actually physically types characters on their keyboard. Specifically, the kernel
will put the process into the "sleeping" state until data is available on stdin.
This is also the case for other types of file descriptors. For instance, if you
try to read from a TCP socket then the read
call will block until the other
side of the connection actually sends data.
Blocking is a problem for programs that should operate concurrently, since blocked processes are suspended. There are two different, complementary ways to solve this problem. They are:
- Nonblocking mode
- I/O multiplexing system calls, such as
select
andepoll
These two solutions are often used together, but they are independent strategies to solving this problem, and often both are used. In a moment we'll see the difference and why they're commonly both used.
Nonblocking Mode (O_NONBLOCK)
A file descriptor is put into "nonblocking mode" by adding O_NONBLOCK
to the
set of fcntl
flags on the file descriptor:
/* set O_NONBLOCK on fd */
int flags = fcntl(fd, F_GETFL, 0);
fcntl(fd, F_SETFL, flags | O_NONBLOCK);
From this point forward the file descriptor is considered nonblocking. When
this happens I/O system calls like read
and write
that would block
will return -1, and errno
will be set to EWOULDBLOCK
.
This is interesting, but on its own is actually not that useful. With just this primitive there's no efficient way to do I/O on multiple file descriptors. For instance, suppose we have two file descriptors and want to read both of them at once. This could be accomplished by having a loop that checks each file descriptor for data, and then sleeps momentarily before checking again:
struct timespec sleep_interval{.tv_sec = 0, .tv_nsec = 1000};
ssize_t nbytes;
for (;;) {
/* try fd1 */
if ((nbytes = read(fd1, buf, sizeof(buf))) < 0) {
if (errno != EWOULDBLOCK) {
perror("read/fd1");
}
} else {
handle_data(buf, nbytes);
}
/* try fd2 */
if ((nbytes = read(fd2, buf, sizeof(buf))) < 0) {
if (errno != EWOULDBLOCK) {
perror("read/fd2");
}
} else {
handle_data(buf, nbytes);
}
/* sleep for a bit; real version needs error checking! */
nanosleep(sleep_interval, NULL);
}
This works, but has a lot of drawbacks:
- When data is coming in very slowly the program will wake up frequently and unnecessarily, which wastes CPU resources.
- When data does come in, the program may not read it immediately if it's sleeping, so the latency of the program will be poor.
- Handling a large number of file descriptors with this pattern would become cumbersome.
To fix these problems we need an I/O multiplexer.
I/O Multiplexing (select, epoll, kqueue, etc.)
There's a few I/O multiplexing system calls. Examples of I/O multiplexing calls
include select
(defined by POSIX),
the epoll family on Linux,
and
the kqueue family
on BSD. These all work fundamentally the same way: they let the kernel know what
events (typically read events and write events) are of interest on a set of file
descriptors, and then they block until something of interest happens. For
instance, you might tell the kernel you are interested in just read events on
file descriptor X, both read and write events on file descriptor Y, and just
write events on file descriptor Z.
These I/O multiplexing system calls typically do not care if the file
descriptors are in blocking mode or nonblocking mode. You can leave all of your
file descriptors in blocking mode and they'll work just fine with select
or
epoll
. If you only call read
and write
on file descriptors returned by
select
or epoll
the calls won't block, even if those file descriptors are in
blocking mode. There's one important exception! The blocking or nonblocking
status of a file descriptor is significant for edge-triggered polling, as
explained further below.
The multiplexing approach to concurrency is what I call "asynchronous I/O". Sometimes people will call this same approach "nonblocking I/O", which I believe comes from a confusion about what "nonblocking" means at the systems programming level. I suggest reserving the term "nonblocking" for referring to whether or not file descriptors are actually in nonblocking mode or not.
How O_NONBLOCK Interacts With I/O Multiplexing
Let's say we're writing a simple socket server using select
with blocking
file descriptors. For simplicity, in this example we just have file descriptors
we want to read from, which are in read_fds
. The core part of the event
loop will call select
and then invoke read
once for each file
descriptor with data:
ssize_t nbytes;
for (;;) {
/* select call happens here */
if (select(FD_SETSIZE, &read_fds, NULL, NULL, NULL) < 0) {
perror("select");
exit(EXIT_FAILURE);
}
for (int i = 0; i < FD_SETSIZE; i++) {
if (FD_ISSET(i, &read_fds)) {
/* read call happens here */
if ((nbytes = read(i, buf, sizeof(buf))) >= 0) {
handle_read(nbytes, buf);
} else {
/* real version needs to handle EINTR correctly */
perror("read");
exit(EXIT_FAILURE);
}
}
}
}
This works and it's perfectly fine. But, what happens if buf
is small, and a
lot of data comes down the line? To be concrete, suppose that buf
is a
1024-byte buffer but 64KB of data comes in all at once. To handle this request
we'll invoke select
followed by read
64 times. That's 128 total system
calls, which is a lot.
If the buffer size is too small read
will have to be called a lot of times,
there's no avoiding that. But perhaps we could reduce the number of times we
call select
? Ideally in this example we would call select
only one
time.
In fact, this is possible, and it's accomplished by putting the file descriptors
into nonblocking mode. The basic idea is that you keep calling read
in a
loop until it returns EWOULDBLOCK
. That looks like this:
ssize_t nbytes;
for (;;) {
/* select call happens here */
if (select(FD_SETSIZE, &read_fds, NULL, NULL, NULL) < 0) {
perror("select");
exit(EXIT_FAILURE);
}
for (int i = 0; i < FD_SETSIZE; i++) {
if (FD_ISSET(i, &read_fds)) {
/* NEW: loop until EWOULDBLOCK is encountered */
for (;;) {
/* read call happens here */
nbytes = read(i, buf, sizeof(buf));
if (nbytes >= 0) {
handle_read(nbytes, buf);
} else {
if (errno != EWOULDBLOCK) {
/* real version needs to handle EINTR correctly */
perror("read");
exit(EXIT_FAILURE);
}
break;
}
}
}
}
}
In this example (1024-byte buffer with 64KB of data incoming) we'll do 66 system
calls: select
will be called one time, read
will be called without error 64
times, and read
will be called and return EWOULDBLOCK
one time. This is much
better! This is nearly half the number from the previous example, which will
improve performance and scalability considerably.
The downside of this approach is that due to the new loop, there's at least one
extra read
that happens since it's called until it returns EWOULDBLOCK
.
Let's say that typically the read buffer is large enough to read all of the
incoming data in one read
call. Then in the usual case through the loop
there will be three system calls rather than just two: select
to wait for
the data, read
to actually read the data, and then read
again to get
EWOULDBLOCK
.
Edge-Triggered Polling
There's one more important use of nonblocking I/O: with edge-triggered polling
in the epoll
system call. This system call has two modes: level-triggered
polling, and edge-triggered polling. Level-triggered polling is a simpler
programming model that's similar to the classic select
system call. To
explain the difference we need to understand how epoll
works within the kernel.
Suppose you tell the kernel you're interested in using epoll
to monitor read
events on some file descriptor. The kernel maintains a list of these interests
for each file descriptor. When data comes in on the file descriptor the kernel
traverses the interests list and wakes up each process that was blocked in
epoll_wait
with that file descriptor in the event list.
What I outlined above happens regardless of what triggering mode epoll
is in.
The difference between level-triggered and edge-triggered polling is what
happens in the kernel when you call epoll_wait
. In level-triggered mode the
kernel will traverse each file descriptor in the interest list to see if it
already matches the interest condition. For instance, if you registered a read
event on file descriptor 8, when calling epoll_wait
the kernel will first
check: does file descriptor 8 already have data ready for reading? If any of the
file descriptors match the interest then epoll_wait
can return without
blocking.
By contrast, in edge-triggered mode the kernel skips this check and immediately
puts the process to sleep when it calls epoll_wait
. This puts all of the
responsibility on you, the programmer, to do the Right Thing and fully read and
write all data for each file descriptor before waiting on this.
This edge-triggered mode is what makes epoll
an O(1) I/O multiplexer: the
epoll_wait
call will suspend immediately, and since a list is maintained for
each file descriptor ahead of time, when new data comes in the kernel
immediately knows what processes must be woken up in O(1) time.
Here's a more worked out example of the difference between edge-triggered and
level-triggered modes. Suppose your read buffer is 100 bytes, and 200 bytes of
data comes in for that file descriptor. Suppose then you call read
exactly one
time and then call epoll_wait
again. There's still 100 bytes of data already
ready to read. In level-triggered mode the kernel would notice this and notify
the process that it should call read
again. By contrast, in edge-triggered
mode the kernel would immediately go to sleep. If the other side is expecting a
response (e.g. the data it sent is some kind of RPC) then the two sides will
"deadlock", as the server will be waiting for the client to send more data, but
the client will be waiting for the server to send a response.
To use edge-triggered polling you must put the file descriptors into
nonblocking mode. Then you must call read
or write
until they return
EWOULDBLOCK
every time. If you fail to meet these conditions you will miss
notifications from the kernel. But there's a big upside of doing this: each call
to epoll_wait
will be more efficient, which can be very important on programs
with extremely high levels of concurrency. If you want to learn more about the
details I strongly encourage you to read
the epoll(7) man page.
Update: I've put up a complete example of using edge-triggered epoll on GitHub: https://github.com/eklitzke/epollet