I am generally a fan of Go’s approach to concurrency: writing code with
goroutines is a lot easier than writing traditional nonblocking network servers
in a language like C or C++. However, while working on a highly concurrent
network proxy I came across an interesting realization about how the Go
concurrency model makes it harder to write programs that do a lot of concurrent
I/O with efficient memory usage.
The program in question is a network proxy akin
to HAProxy or Envoy.
Typically the proxy has a very large number of clients connected, but most of
those clients are actually idle with no outstanding network requests. Each
client connection has a read buffer and a write buffer. Therefore the naive
memory usage of such a program is at least: *#connections * (readbuf_sz +
writebuf_sz)*.
There’s a trick you can do in a C or C++ program of this nature to reduce memory
usage. Suppose that typically 5% of the client connections are actually active,
and the other 95% are idle with no pending reads or writes. In this situation
you can create a pool of buffer objects. When connections are actually active
they acquire buffers to use for reading/writing from the pool, and when the
connections are idle they release the buffers back to the pool. This reduces the
number of allocated buffers to approximately the number of buffers actually
needed by active connections. In this case using this technique will give a 20x
memory reduction, since only 5% as many buffers will be allocated compared to
the naive approach.
The reason this technique works at all is due to how nonblocking reads and
writes work in C. In C you use a system call like select(2) or epoll_wait(2)
to get a notification that a file descriptor is ready to be read/written, and
then after that you explicitly call read(2) or write(2) yourself on that
file descriptor. This gives you the opportunity to acquire a buffer after the
call to select/epoll, but before making the read call. Here’s a simplified
version of the core part of the event loop for a network proxy demonstrating
this technique:
for (;;) {
// wait for some file descriptors to be read to read
int nfds = epoll_wait(epollfd, events, MAX_EVENTS, -1);
for (int n = 0; n < nfds; n++) {
// acquire a buffer
void *buf = acquire_buffer(pool);
// read the file descriptor
int clientfd = events[n].data.fd;
ssize_t nbytes = read(clientfd, buf, BUFSIZE);
if (nbytes < 0) {
// error case
handle_error(fd, clientfd);
} else {
// process the request
process_read(clientfd, buf, nbytes);
}
// return the buffer to the pool
release_buffer(pool, buf);
}
}
In a synchronous I/O model, such as in a Go program (or a C/C++ program doing
blocking I/O with threads) there’s no opportunity to get a buffer before doing
the read. Instead you supply a buffer to your read call, your goroutine (or
thread) blocks until data is ready, and then execution resumes:
// code within the read loop for each connection/goroutine
nbytes, err := conn.Read(buf)
This means that you must have a buffer allocated for each connection: the
buffer is supplied to the read call itself, meaning that there must be a buffer
associated with each connection that reads may occur on. Unlike with C, in Go
there is no way to know if a connection is readable, other than to actually try
to read data from it. This means that at the minimum a Go proxy with a large
number of mostly idle connections will churn through a large amount of virtual
memory space, and likely incur a large RSS memory footprint over time as well.
This isn’t the only place where Go’s high-level approach to networking
reduces efficiency. For instance, Go lacks a way to
do vectorized network I/O.
If you have ideas on how to solve these problems let me know via email or
Twitter and I will update this post.
Update 1: According
to this GitHub issue it appears
that writev(2) support will appear in Go 1.8, and readv(2) support may
appear one day (but not in 1.8).
Update 2: It’s been pointed out to me that a trick for working around this
issue in Go is allocate a one-byte buffer per connection, and then switch to a
larger buffer (possibly from a pool) after the one-byte buffer fills up. This is
a really interesting workaround: it does solve the memory issue, but there’s a
small penalty due to making extra read calls. There’s also at least
one
proposal to expose the low-level runtime event loop which
would make it possible to write more traditional event loop code (although this
seems to be a bit against things as simple as possible).