eklitzke.org

repz stosb

Blocking I/O, Nonblocking I/O, And Epoll

In this post I want to explain exactly what happens when you use nonblocking I/O. In particular, I want to explain:

Blocking Mode

By default, all file descriptors on Unix systems start out in "blocking mode". That means that I/O system calls like read, write, or connect can block. A really easy way to understand this is to think about what happens when you read data on stdin from a regular TTY-based program. If you call read on stdin then your program will block until data is actually available, such as when the user actually physically types characters on their keyboard. Specifically, the kernel will put the process into the "sleeping" state until data is available on stdin. This is also the case for other types of file descriptors. For instance, if you try to read from a TCP socket then the read call will block until the other side of the connection actually sends data.

Blocking is a problem for programs that should operate concurrently, since blocked processes are suspended. There are two different, complementary ways to solve this problem. They are:

These two solutions are often used together, but they are independent strategies to solving this problem, and often both are used. In a moment we'll see the difference and why they're commonly both used.

Nonblocking Mode (O_NONBLOCK)

A file descriptor is put into "nonblocking mode" by adding O_NONBLOCK to the set of fcntl flags on the file descriptor:

/* set O_NONBLOCK on fd */
int flags = fcntl(fd, F_GETFL, 0);
fcntl(fd, F_SETFL, flags | O_NONBLOCK);

From this point forward the file descriptor is considered nonblocking. When this happens I/O system calls like read and write that would block will return -1, and errno will be set to EWOULDBLOCK.

This is interesting, but on its own is actually not that useful. With just this primitive there's no efficient way to do I/O on multiple file descriptors. For instance, suppose we have two file descriptors and want to read both of them at once. This could be accomplished by having a loop that checks each file descriptor for data, and then sleeps momentarily before checking again:

struct timespec sleep_interval{.tv_sec = 0, .tv_nsec = 1000};
ssize_t nbytes;
for (;;) {
    /* try fd1 */
    if ((nbytes = read(fd1, buf, sizeof(buf))) < 0) {
        if (errno != EWOULDBLOCK) {
            perror("read/fd1");
        }
    } else {
        handle_data(buf, nbytes);
    }

    /* try fd2 */
    if ((nbytes = read(fd2, buf, sizeof(buf))) < 0) {
        if (errno != EWOULDBLOCK) {
            perror("read/fd2");
        }
    } else {
        handle_data(buf, nbytes);
    }

    /* sleep for a bit; real version needs error checking! */
    nanosleep(sleep_interval, NULL);
}

This works, but has a lot of drawbacks:

To fix these problems we need an I/O multiplexer.

I/O Multiplexing (select, epoll, kqueue, etc.)

There's a few I/O multiplexing system calls. Examples of I/O multiplexing calls include select (defined by POSIX), the epoll family on Linux, and the kqueue family on BSD. These all work fundamentally the same way: they let the kernel know what events (typically read events and write events) are of interest on a set of file descriptors, and then they block until something of interest happens. For instance, you might tell the kernel you are interested in just read events on file descriptor X, both read and write events on file descriptor Y, and just write events on file descriptor Z.

These I/O multiplexing system calls typically do not care if the file descriptors are in blocking mode or nonblocking mode. You can leave all of your file descriptors in blocking mode and they'll work just fine with select or epoll. If you only call read and write on file descriptors returned by select or epoll the calls won't block, even if those file descriptors are in blocking mode. There's one important exception! The blocking or nonblocking status of a file descriptor is significant for edge-triggered polling, as explained further below.

How O_NONBLOCK Interacts With I/O Multiplexing

Let's say we're writing a simple socket server using select with blocking file descriptors. For simplicity, in this example we just have file descriptors we want to read from, which are in read_fds. The core part of the event loop will call select and then invoke read once for each file descriptor with data:

ssize_t nbytes;
for (;;) {
    /* select call happens here */
    if (select(FD_SETSIZE, &read_fds, NULL, NULL, NULL) < 0) {
        perror("select");
        exit(EXIT_FAILURE);
    }
    for (int i = 0; i < FD_SETSIZE; i++) {
        if (FD_ISSET(i, &read_fds)) {
            /* read call happens here */
            if ((nbytes = read(i, buf, sizeof(buf))) >= 0) {
                handle_read(nbytes, buf);
            } else {
                /* real version needs to handle EINTR correctly */
                perror("read");
                exit(EXIT_FAILURE);
            }
        }
    }
}

This works and it's perfectly fine. But, what happens if buf is small, and a lot of data comes down the line? To be concrete, suppose that buf is a 1024-byte buffer but 64KB of data comes in all at once. To handle this request we'll invoke select followed by read 64 times. That's 128 total system calls, which is a lot.

If the buffer size is too small read will have to be called a lot of times, there's no avoiding that. But perhaps we could reduce the number of times we call select? Ideally in this example we would call select only one time.

In fact, this is possible, and it's accomplished by putting the file descriptors into nonblocking mode. The basic idea is that you keep calling read in a loop until it returns EWOULDBLOCK. That looks like this:

ssize_t nbytes;
for (;;) {
    /* select call happens here */
    if (select(FD_SETSIZE, &read_fds, NULL, NULL, NULL) < 0) {
        perror("select");
        exit(EXIT_FAILURE);
    }
    for (int i = 0; i < FD_SETSIZE; i++) {
        if (FD_ISSET(i, &read_fds)) {
            /* NEW: loop until EWOULDBLOCK is encountered */
            for (;;) {
                /* read call happens here */
                nbytes = read(i, buf, sizeof(buf));
                if (nbytes >= 0) {
                    handle_read(nbytes, buf);
                } else {
                    if (errno != EWOULDBLOCK) {
                        /* real version needs to handle EINTR correctly */
                        perror("read");
                        exit(EXIT_FAILURE);
                    }
                    break;
                }
            }
        }
    }
}

In this example (1024-byte buffer with 64KB of data incoming) we'll do 66 system calls: select will be called one time, read will be called without error 64 times, and read will be called and return EWOULDBLOCK one time. This is much better! This is nearly half the number from the previous example, which will improve performance and scalability considerably.

The downside of this approach is that due to the new loop, there's at least one extra read that happens since it's called until it returns EWOULDBLOCK. Let's say that typically the read buffer is large enough to read all of the incoming data in one read call. Then in the usual case through the loop there will be three system calls rather than just two: select to wait for the data, read to actually read the data, and then read again to get EWOULDBLOCK.

Edge-Triggered Polling

There's one more important use of nonblocking I/O: with edge-triggered polling in the epoll system call. This system call has two modes: level-triggered polling, and edge-triggered polling. Level-triggered polling is a simpler programming model that's similar to the classic select system call. To explain the difference we need to understand how epoll works within the kernel.

Suppose you tell the kernel you're interested in using epoll to monitor read events on some file descriptor. The kernel maintains a list of these interests for each file descriptor. When data comes in on the file descriptor the kernel traverses the interests list and wakes up each process that was blocked in epoll_wait with that file descriptor in the event list.

What I outlined above happens regardless of what triggering mode epoll is in. The difference between level-triggered and edge-triggered polling is what happens in the kernel when you call epoll_wait. In level-triggered mode the kernel will traverse each file descriptor in the interest list to see if it already matches the interest condition. For instance, if you registered a read event on file descriptor 8, when calling epoll_wait the kernel will first check: does file descriptor 8 already have data ready for reading? If any of the file descriptors match the interest then epoll_wait can return without blocking.

By contrast, in edge-triggered mode the kernel skips this check and immediately puts the process to sleep when it calls epoll_wait. This puts all of the responsibility on you, the programmer, to do the Right Thing and fully read and write all data for each file descriptor before waiting on this.

This edge-triggered mode is what makes epoll an O(1) I/O multiplexer: the epoll_wait call will suspend immediately, and since a list is maintained for each file descriptor ahead of time, when new data comes in the kernel immediately knows what processes must be woken up in O(1) time.

Here's a more worked out example of the difference between edge-triggered and level-triggered modes. Suppose your read buffer is 100 bytes, and 200 bytes of data comes in for that file descriptor. Suppose then you call read exactly one time and then call epoll_wait again. There's still 100 bytes of data already ready to read. In level-triggered mode the kernel would notice this and notify the process that it should call read again. By contrast, in edge-triggered mode the kernel would immediately go to sleep. If the other side is expecting a response (e.g. the data it sent is some kind of RPC) then the two sides will "deadlock", as the server will be waiting for the client to send more data, but the client will be waiting for the server to send a response.

To use edge-triggered polling you must put the file descriptors into nonblocking mode. Then you must call read or write until they return EWOULDBLOCK every time. If you fail to meet these conditions you will miss notifications from the kernel. But there's a big upside of doing this: each call to epoll_wait will be more efficient, which can be very important on programs with extremely high levels of concurrency. If you want to learn more about the details I strongly encourage you to read the epoll(7) man page.

Update: I've put up a complete example of using edge-triggered epoll on GitHub: https://github.com/eklitzke/epollet


Goroutines, Nonblocking I/O, And Memory Usage

I am generally a fan of Go's approach to concurrency: writing code with goroutines is a lot easier than writing traditional nonblocking network servers in a language like C or C++. However, while working on a highly concurrent network proxy I came across an interesting realization about how the Go concurrency model makes it harder to write programs that do a lot of concurrent I/O with efficient memory usage.

The program in question is a network proxy akin to HAProxy or Envoy. Typically the proxy has a very large number of clients connected, but most of those clients are actually idle with no outstanding network requests. Each client connection has a read buffer and a write buffer. Therefore the naive memory usage of such a program is at least: #connections * (readbuf_sz + writebuf_sz).

There's a trick you can do in a C or C++ program of this nature to reduce memory usage. Suppose that typically 5% of the client connections are actually active, and the other 95% are idle with no pending reads or writes. In this situation you can create a pool of buffer objects. When connections are actually active they acquire buffers to use for reading/writing from the pool, and when the connections are idle they release the buffers back to the pool. This reduces the number of allocated buffers to approximately the number of buffers actually needed by active connections. In this case using this technique will give a 20x memory reduction, since only 5% as many buffers will be allocated compared to the naive approach.

The reason this technique works at all is due to how nonblocking reads and writes work in C. In C you use a system call like select(2) or epoll_wait(2) to get a notification that a file descriptor is ready to be read/written, and then after that you explicitly call read(2) or write(2) yourself on that file descriptor. This gives you the opportunity to acquire a buffer after the call to select/epoll, but before making the read call. Here's a simplified version of the core part of the event loop for a network proxy demonstrating this technique:

for (;;) {
    // wait for some file descriptors to be read to read
    int nfds = epoll_wait(epollfd, events, MAX_EVENTS, -1);

    for (int n = 0; n < nfds; n++) {
        // acquire a buffer
        void *buf = acquire_buffer(pool);

        // read the file descriptor
        int clientfd = events[n].data.fd;
        ssize_t nbytes = read(clientfd, buf, BUFSIZE);
        if (nbytes < 0) {
            // error case
            handle_error(fd, clientfd);
        } else {
            // process the request
            process_read(clientfd, buf, nbytes);
        }

        // return the buffer to the pool
        release_buffer(pool, buf);
     }
}

In a synchronous I/O model, such as in a Go program (or a C/C++ program doing blocking I/O with threads) there's no opportunity to get a buffer before doing the read. Instead you supply a buffer to your read call, your goroutine (or thread) blocks until data is ready, and then execution resumes:

// code within the read loop for each connection/goroutine
nbytes, err := conn.Read(buf)

This means that you must have a buffer allocated for each connection: the buffer is supplied to the read call itself, meaning that there must be a buffer associated with each connection that reads may occur on. Unlike with C, in Go there is no way to know if a connection is readable, other than to actually try to read data from it. This means that at the minimum a Go proxy with a large number of mostly idle connections will churn through a large amount of virtual memory space, and likely incur a large RSS memory footprint over time as well.

This isn't the only place where Go's high-level approach to networking reduces efficiency. For instance, Go lacks a way to do vectorized network I/O. If you have ideas on how to solve these problems let me know via email or Twitter and I will update this post.

Update 1: According to this GitHub issue it appears that writev(2) support will appear in Go 1.8, and readv(2) support may appear one day (but not in 1.8).

Update 2: It's been pointed out to me that a trick for working around this issue in Go is allocate a one-byte buffer per connection, and then switch to a larger buffer (possibly from a pool) after the one-byte buffer fills up. This is a really interesting workaround: it does solve the memory issue, but there's a small penalty due to making extra read calls. There's also at least one proposal to expose the low-level runtime event loop which would make it possible to write more traditional event loop code (although this seems to be a bit against things as simple as possible).


Hanging A Process With Select

In my last article I talked about how to stop a Unix process. There's another technique that you can use if you control the source code for the program you're trying to pause.

The prototype for select(2) is like this:

int select(int nfds, fd_set *readfds, fd_set *writefds,
           fd_set *exceptfds, struct timeval *timeout);

If you set nfds to zero and supply NULL for all three fd sets then select() will sleep for the amount of time in timeout. If you set timeout to NULL then the call to select() will block forever, like so:

// C: put the process in blocking-wait forever
select(0, NULL, NULL, NULL, NULL);

You can use this technique in other languages too. For instance, in Python you get:

# Python: put the process in blocking-wait forever
import select
select.select([], [], [])

I prefer this to using SIGSTOP when I actually control the source code for the program, since it makes it easier to control exactly where the program pauses. I've frequently found this trick useful when writing programs that create temporary files. By forcing the process to pause itself while the temporary file exists and before it's deleted, it's easier to poke around in another terminal to look at the state of things to figure out what's going on.

One thing to be aware of with this technique is that only the thread that's calling select(2) will be blocked. If you want to stop all threads I would recommend using a debugger like GDB, or sending SIGSTOP to the process group leader.


Stopping A Process

From time to time you may want to pause a process' execution. For an interactive program started on a terminal you can typically do this by typing ^Z (Ctrl-Z) to temporarily suspend the process. But what if the process isn't running in the foreground? For instance, what if you want to stop a daemon process?

There's a special Unix signal called SIGSTOP that can be used for this purpose. When this signal is sent to the process, instead of running the process' SIGSTOP handler the kernel will actually suspend the process. The process will stay in the suspended state until SIGCONT is delivered to it.

From the command line, you can send these signals like this (substituting an actual PID):

kill -STOP PID  # stop the process
kill -CONT PID  # resume the process

One situation where I've found this to be useful is testing the behavior of programs when they interact with another program that hangs. For instance, let's say you're writing a program that interacts with a database, and you want to know how the program will behave if the database becomes unresponsive. You can set up your program to connect to the database and start doing work, and then deliver SIGSTOP to the database process. When this happens all of the existing connections the application has to the database will still be alive (including any TCP sockets), but the database won't respond. This is a reasonably facsimile of what happens when the database is under very high load.

Trivia 1: You might be wondering if SIGSTOP is how ^Z works. Actually, on Linux ^Z sends a related (but distinct) signal called SIGTSTP. This also stops the process, but differs from SIGSTOP in that it can be caught. So you can write a program that handles and ignores SIGTSTP, but you can't write one that ignores SIGSTOP.

Trivia 2: There are only two signals that cannot be caught: SIGSTOP is one, and SIGKILL is the other.


Stdout Buffering

Most programming languages offered buffered I/O features by default, since it makes generating output much more efficient. These buffered I/O facilities typically "Just Work" out of the box. But sometimes they don't. When we say they "don't work" what we mean is that excess buffering occurs, causing data not to be printed in a timely manner. This is typically fixed by explicitly putting a "flush" call in the code, e.g. with something like sys.stdout.flush() in Python, fflush(3) in C, or std::flush in C++.

Frequently when people are confused about the rules of buffering their code becomes littered with unnecessary flush statements, an example of cargo-cult programming. In this post I'll explain the buffering rules for stdout, so you'll never be confused again.

Why Buffering Exists

As already discussed, the problem with buffering is that it can cause output to be delayed. So why does it exist at all?

At the underlying system call level, data is written to file descriptors using write(2). This method takes a file descriptor and a byte buffer, and writes the data in the byte buffer to the file descriptor.

Most languages have very fast function calls. The overhead for a function call in a compiled language like C or C++ is just a few CPU cycles. In these languages it's common to think of functions call overhead as negligible, and only in extreme cases are functions marked as inline. However, a system call is much more expensive. A system call on Linux takes closer to a thousand CPU cycles and implies a context switch. Thus system calls are significantly more expensive than regular userspace function calls.

The main reason why buffering exists is to amortize the cost of these system calls. This is primarily important when the program is doing a lot of these write calls, as the amortization is only effective when the system call overhead is a significant percentage of the program's time.

Let's consider what happens when you use grep to search for a pattern in an input file (or stdin). Suppose you're grepping nginx logs for a pattern—say lines from a particular IP address. A typical line length in these nginx logs might be 100 characters. That means that if buffering wasn't used, for each matching line in the input file that grep needs to print, it would invoke the write(2) system call. This would happen over and over again, and each time the average buffer size would be 100 bytes. If, instead, a 4096-byte buffer size is used then data won't be flushed until the 4096-byte buffer fills up. This means that in this mode the grep command would wait until it had about 40 lines of input before the byte buffer filled up. Then it would flush the buffer by invoking write(2) with a pointer to the 4096-byte buffer. This effectively transforms forty system calls into one, yielding a 40x decrease in system call overhead. Not bad!

If the grep command is sending a lot of data to stdout you won't even notice the buffering delay. And a grep command matching a simple pattern can easily spend more time trying to print data than actually filtering the input data. But suppose instead the grep pattern occurs very infrequently. Suppose it's so uncommon that a matching input line is only found once every 10 seconds. Then we'd have to wait about 400 seconds (more than six minutes!) before seeing any output, even though grep actually found data within the first ten seconds.

Buffering Anti-Patterns

This buffering can be especially insidious in certain shell pipelines. For instance, suppose we want to print the first matching line in a log file. The invocation might be:

# BAD: grep will buffer output before sending it to head
grep RAREPATTERN /var/log/mylog.txt | head -n 1

Going with the previous example, we would like this command to complete within ten seconds, since that's the average amount of time it will take grep to find the input pattern in this file. But if buffering is enabled then the pipeline will instead take many minutes to run. In other words, in this example buffering makes the program strictly slower, not faster!

Even in cases where the output isn't being limited by a command like head(1), if output is very infrequent then buffering can be extremely annoying and provide essentially zero performance improvement.

When Programs Buffer, And When They Don't

There are typically three modes for buffering:

GNU libc (glibc) uses the following rules for buffering:

Stream Type Behavior
stdin input line-buffered
stdout (TTY) output line-buffered
stdout (not a TTY) output fully-buffered
stderr output unbuffered

As you can see, the behavior for stdout is a bit unusual: the exact behavior for stdout depends on whether or not it appears to be a TTY. The rationale here is that when stdout is a TTY it means a user is likely watching the command run and waiting for output, and therefore printing data in a timely manner is most important. On the other hand, if the output isn't a TTY the assumption is that the data is being processed or saved for later use, and therefore efficiency is more important.

Most other programming languages have exactly the same rules: either because those languages implement function routines as calls to buffered libc output commands (such as printf(3)), or because they actually implement the same logic.

More Grep Examples

Grep is a special case for buffering because a grep command can turn a large amount of input data into a slow and small stream of output data. Therefore grep is particularly susceptible to buffering frustration. Knowing when grep will buffer data is easy: it follows the glibc buffering rules described above.

If the output of grep is a TTY then it will be line-buffered. If the output of grep is sent to a file or a pipe, it will be fully-buffered, as the output destination is not a TTY.

Examples

This grep command will be line-buffered, since stdout is a TTY:

# line-buffered
grep RAREPATTERN /var/log/mylog.txt

If stdout is redirected to a file then stdout is no longer a TTY, and output will be fully-buffered. This is usually fine:

# fully-buffered
grep RAREPATTERN /var/log/mylog.txt >output.txt

One situation where the previous example isn't ideal is if you have another terminal output that is trying to tail -f the output file.

Suppose we want to search the file backwards by piping tac(1) to grep. This will be line-buffered, as grep is still the last command in the pipeline and thus stdout is still a TTY:

# line-buffered
tac /var/log/mylog.txt | grep RAREPATTERN

But what if we want to filter the output of grep? If we use a shell pipeline this will cause the grep output to become buffered. For instance, consider the following:

# fully-buffered
grep RAREPATTERN /var/log/mylog.txt | cut -f1

The issue here is that when we put a pipe after the grep command, grep's stdout is now the file descriptor for a pipe. Pipes are not TTYs, and thus grep will go into fully-buffered mode.

For the grep command the solution is to use the --line-buffered option to force line-buffering:

# forced line-buffering
grep --line-buffered RAREPATTERN /var/log/mylog.txt | cut -f1

As noted earlier, you may also want to use this when redirecting grep output to a file and then consuming the file in another session using tail -f.

Setbuf

If you're writing your own C code, you can control the buffering for FILE* streams using setbuf(3). Using this you can force behavior such as always line-buffering stdout. You can also use this for disk-backed files, so you can do things like write a file to disk and have fprintf(3) be automatically line-buffered.

Stdbuf

GNU coreutils comes with a program called stdbuf that allows you to change the default buffering behavior of programs you don't control. There are a few caveats for target programs: the programs must use C FILE* streams, and the programs can't use the explicit buffer control routines like setbuf(3).

C++ IOStreams

There's one further gotcha that typically pops up in C++ programs. Many C++ programmers are accustomed to using std::endl for newlines. For instance,

// Two ways to print output with a newline ending.
std::cout << "Hello, world!\n";
std::cout << "Hello, world!" << std::endl;

These are not the same. The difference is that when std::endl is used it automatically forces the output stream to be flushed, regardless of the output mode of the stream. For instance,

// Subject to normal buffering rules.
std::cout << "Hello, world!\n";

// These are equivalent and are *always* line-buffered.
std::cout << "Hello, world!\n" << std::flush;
std::cout << "Hello, world!" << std::endl;

Thus if you're using std::endl a lot then the usual buffering rules don't apply, since std::endl is effectively forcing line-buffering! This can be important in certain performance sensitive programs, since using std::endl can inadvertently disable buffering.

My suggestion is: only use std::endl when you actually want to flush the output stream. If you don't know if the stream should be forcibly flushed then stick to using a regular \n sequence in your code.


How To Nice A Bash Function

I was recently writing a bash script where I wanted to nice a particular function. The function in question was doing multiple expensive operations (e.g. gzip), and I didn't want to have to put a nice invocation in front of every one of these operations.

It turns out that this is kind of tricky. The nice command isn't a builtin, and therefore requires a real command to run with the new niceness value. For instance, the following bash will exit with an error due to foo not being an actual executable:

foo() {
  gzip -d bigfile.gz
  grep "hello world" /dev/urandom
}

# try to lower the priority of foo
nice foo

In this example the printed error message sill be something like nice: ‘foo’: No such file or directory.

One idea I had for overcoming this obstacle is to renice the process twice. For instance, I was initially considering the following, where I lower the nice value at function entry and then raise it again upon function exit:

foo() {
  renice 10 $$   # lower the niceness
  gzip -d bigfile.gz
  grep "hello world" /dev/urandom
  renice -10 $$  # raise the niceness again
}

Unfortunately this doesn't work because only the root user can decrease niceness. So if the script is being run as an unprivileged user the first renice command will work, but the second will fail.

The solution I came up with is to use renice in conjunction with bash subshells. The original version I had looked like this:

foo() {
  renice 10 $BASHPID  # note $BASHPID, *not* $$
  gzip -d bigfile.gz
  grep "hello world" /dev/urandom
}

# invoke foo in a subshell
(foo)

This works because the subshell will execute as a separate process, which means that there's no need to undo the renice command. Note here that $BASHPID must be used here, instead of $$ which is much more commonly used to locate the current process id. The reason is that bash subshells actually return the original shell's PID when using $$, whereas $BASHPID always returns the actual PID for the process.

The downside to this approach—which I think is significant—is that if you forget to invoke the function in a subshell then the entire script will end up niced. For instance, given the previous definition for foo we'll have the following correct and incorrect invocations:

(foo)  # CORRECT: uses a subshell
foo    # INCORRECT: now the entire script is niced

This is pretty subtle. To understand what's happening here you need to know what a subshell is, what the syntax is for subshell invocation, and you need to understand why a subshell is needed in the first place.

My friend James pointed out that this can be fixed by invoking the subshell within the function. That looks like this:

foo() {
  (renice 10 $BASHPID
   gzip -d bigfile.gz
   grep "hello world" /dev/urandom)
}

Now foo can be invoked as normal. Nifty!


Crosh/SSH and ChromeOS

I have a Chromebook Pixel, which I like a lot and use regularly. Generally I use my Chromebook for non-coding tasks; for coding I prefer to use my Thinkpad (which has Fedora installed) because it has a "real" coding environment.

However, sometimes I'll code on my Chromebook as well, e.g. when I'm traveling and it's more convenient to bring with me. When I write code on my Chromebook I use a Debian chroot environment via crouton to do development. This is a pretty reasonable facsimile of writing code on a "real" Linux laptop: you can use things like Emacs, GCC, GDB, and all of the other GNU/Linux tools you know and love.

The biggest pain point of doing development this way is the window management in ChromeOS. I like having lots of terminal windows open, and ChromeOS makes it hard to do this. The "normal" and well-documented way to create a new terminal is to type Ctrl+Alt+T to get a new crosh window to appear as a new tab in a browser window. From the crosh tab you can get a terminal by typing shell. But this is pretty inconvenient: the new crosh window starts out as a browser tab, not a new window, and there's a lot typing.

There's an extremely poorly documented feature of the SSH app that makes this much easier. From the SSH app you can get a new crosh terminal window by setting the host to be the magic string >crosh and then supplying any username. This way when you type Ctrl+Shift+N from an existing crosh shell, a new window will pop up ready to become a terminal.


x86 Register Encoding

If you've ever look at the disassembly output for C or C++ code, you'll probably notice that there are a lot of push/pop instructions. And if you pay close enough attention, you'll notice that the compiler prefers to use certain registers over others. In particular, compilers will prefer pushing "old" registers like RBP (i.e. the ones available on 32-bit x86 CPUs) instead of the "new" registers like R15 (which aren't available in 32-bit mode).

The C calling convention on x86 systems specifies that callees need to save certain registers. There are a few different names for these kinds of registers, such as nonvolatile registers, callee-saved registers, and so on.

If a method needs to use some registers, it's best to use the volatile registers, since they don't require additional push/pop instructions. However, there are only a few volatile registers. If a method needs additional registers it will have to dip in to the nonvolatile set. These nonvolatile registers must be pushed on function entry, and popped on function exit. So that explains the first part: why these registers are pushed/popped at all.

But what about the second part: why does the compiler prefer pushing/popping the old registers RBX, RBP, RDI, RSI, and RSP over the new registers R12, R13, R14, and R15?

The answer lies in the historical legacy of how instruction encoding worked on 32-bit systems. First let's look at a chart showing the different general purpose registers, and their characteristics:

Number Register Volatile? Old/New
0 RAX Yes Old
1 RCX Yes Old
2 RDX Yes Old
3 RBX No Old
4 RSP No Old
5 RBP No Old
6 RSI No Old
7 RDI No Old
8 R8 Yes New
9 R9 Yes New
10 R10 Yes New
11 R11 Yes New
12 R12 No New
13 R13 No New
14 R14 No New
15 R15 No New

Since the designers of x86 knew that these registers were going to be pushed/popped all the time, they wanted to try to make the push/pop instructions really compact. So they reserved one-byte instruction encodings to push/pop every register. This is pretty unusual: there aren't too many instructions that can be encoded with a single byte. The one-byte instruction encodings are only used for the most common instructions.

To push a register, you take the number in the chart above and add it to 0x50. So if you want to push RSP, the instruction is 0x54, which is 0x50 + 4.

To pop a register, you take the number in the chart above and add it to 0x58. So if you want to pop RSP, the instruction is 0x5c, which is 0x58 + 4.

As you can see, they only reserved space for eight registers when pushing; the same is true when popping. This makes sense, because at the time there were only eight general purpose registers. However, this is a problem because no space was reserved for the higher numbers.

When they designed the 64-bit versions of x86 they came up with a clever solution for this problem, aimed at keeping backwards compatibility with 32-bit systems. They added new prefix instructions to indicate that certain fields should be the extended versions. The details of how this work are a bit complicated, but for a push or pop instruction the prefix 0x41 means that the register should be considered the extended version, and the register number is then subtracted by eight when encoding.

Here's an example. Suppose we want to push R9. The first byte of the instruction is 0x41. The second byte is 0x50 + (9 - 8) = 0x51. Thus the full encoding will be 0x4151.

Suppose we want to pop R14. The encoding will be 0x41 followed by 0x58 + (14 - 8) = 0x5e. Thus the fully encoded instruction will be 0x415e.

As you can see, the original eight registers have more compact encodings: they can each be pushed and popped with a single byte, whereas the new registers require two bytes to push/pop. This applies to certain other instructions too, not just push/pop. The actual time it takes to execute a push/pop is the same either way, so there's not any actual CPU cycles saved. But using smaller instructions means a slightly smaller executable, means less data for the decoding pipeline to process, and means that instructions are more likely to stay in caches. So if you can, it's better to use the old registers: you'll save a byte for each push/pop.


Pyflame Dual Interpreter Mode

I recently implemented "dual interpreter mode" for Pyflame, which allows Pyflame to be compiled to target both Python 2 and Python 3 at the same time, in the same executable. This is extremely unusual, and Pyflame is the only Python project implemented in C/C++ that I am aware of that has this feature. In this post I'll explain how I implemented this feature for Pyflame.

The Problem

In order for Pyflame to work, it has to know a lot of details about the internals of the Python interpreter. The most important thing it must know is the struct offsets for various fields in Python objects. As an example, Pyflame needs to know the offsets for where to find things like a frame's pointer to a "code" object. If Pyflame thinks that the code object is at offset 48, but it's actually at offset 56, then Pyflame will get GIGO when trying to decode the stack.

Fortunately you can get all of these offsets from Python.h, and this is exactly what Pyflame does, and has always done. Unfortunately, these struct offsets differ between Python 2 and Python 3. This means that when you compile Pyflame you either give it the Python.h for Python 2, or the Python.h for Python 3, and the resulting Pyflame executable can only profile that version of Python.

There's another complication, which is that Python 2 and Python 3 declare many of the same symbols. This means that even aside from this struct offset issue, normally you wouldn't be able to compile an executable that links against say libpython2.7.so and libpython3.5m.so at the same time.

However, Pyflame isn't a normal C++ program. It actually only uses the Python headers to get struct offsets, and does not link against libpython. So in principle you could come up with a way to build a "dual" Pyflame executable that can profile both Python 2 and Python 3 processes.

While it's an interesting thought experiment to think about how to build a Pyflame that can support both Python interpreter versions at once, it's not really that useful. Most people are either using Python 2 or Python 3, so just supporting one at compile time is not a big deal. People who need both Python versions can just compile two versions. So I had created an issue to remind myself to look into this, but I had considered it very low priority.

This changed recently when I decided to try to get Pyflame into Fedora, and it occurred to me that if I actually did this crazy dual-interpreter mode it would make my packaging life a lot easier. Instead of maintaining python2-pyflame and python3-pyflame, I'd be able to just add a single package. And since there's no linking dependency, I can support both Python interpreters essentially for free. So off I went.

The Solution

There's two parts to solving this. The first is how the code is refactored to support two Python releases with minimal code duplication. The second part is how the build system (autoconf/automake) needed to be changed.

If you'd like to follow along with the changes, please see PR 42.

Code Changes

The code for Python 2 and Python 3 is 95% the same in my estimation. The struct offsets in Python 2 and Python 3 do differ, but other than that the only material change is how strings work in both releases, which is easy to work around with preprocessor macros, and which I had already done.

The solution I came up with here is to define a filed called frob.cc which implements all of the Python internals logic. This file includes Python.h as usual. It has the following compile-time logic:

There are two stub files that include frob.cc: frob2.cc includes it in a way so it's configured to build for Python 2, frob3.cc includes it in a way so it's configured to build for Python 3. The file frob.cc itself is never built into an object, only frob{2,3}.cc are actually compiled and linked.

There's another set of files called pyfrob.{cc.h} that have the following logic:

The way I implemented this Pyflame will do all of this runtime logic just once when Pyflame starts up. Then while it's running it will invoke the interpreter-specific bits using function pointers. This is a pretty small optimization, but avoids additional runtime branching.

Most of the work here was actually refactoring the existing code to be consolidated into fewer files, and the new logic for detecting the Python version. I ended up touching most of the Pyflame codebase to get this to work. The preprocessor macros are pretty hairy in my opinion, but ended up working out fine.

Automake Changes

There's a lot of compilation logic that needs to change for this to work:

The hardest part of this was figuring out how to compile frob2.cc and frob3.cc with different include paths. I found an automake documentation page called Per-Object Flags Emulation. which is short, but does cover how to do this. I actually ended up bringing in libtool (which is a compile-time dependency only) since it provides some convenience methods.

I also had to change a lot of logic in my configure.ac so it would know how to pick between the two. The current solution detects what Python releases are on the system, and enables all of the supported ones. I'm not super happy with this: in the future I'll probably revisit the code to allow building against just one release or another.

Next Steps

Once I get my PR reviewed and landed I'm going to tag a major new version of Pyflame, and then try to base my Fedora package submission on that. The next major feature I'm adding to Pyflame will be from issue 13. This issue describes rewriting the code to not use _PyThreadState_Current, and instead find the global "interpreter" list and use that to find the threads. This will let me get stack traces from idle threads which has a ton of really interesting use cases.


Call Stacks and Link Registers

When your computer executes code, there's an implicit call stack at any given point in time. The call stack holds the necessary information for functions to return to their caller. For instance, let's say we have this C code:

void foo(int n) {
  for (int i = 0; i < n; i++) {
    printf("i = %d\n", i);
  }
}

After each call to printf(), the routine needs to return to a specific place in foo(). With nesting of functions this gets even more complicated:

void bar(int n) {
  if (n % 2) {
    foo(n);
    puts("odd branch");
  } else (
    foo(n / 2);
    puts("even branch");
  }
}

Now printf() needs to remember how to return to foo(), and foo() has to remember how to return to two different places within bar(). And presumably there's some other function that's called bar(), and bar() needs to know how to return to that.

The obvious data structure to hold this information is a stack. The way this might work is you'd push a return address onto this stack when making a function call, and then when you wanted to return to your caller you'd pop the return value off the stack.

In fact, this is exactly how pretty much every computer architecture works. The complication, however, is that there isn't actually a dedicated stack for call information. Instead the regular program stack is used to store the return addresses. Thus these return addresses are intermingled with regular program data on the stack.

Specifically here's how it works in x86 assembly. You call a function like this:

call function_or_address           # function call

The call does two things: it pushes the the address of the next instruction onto the stack, and then it jumps to the call address. Later, when the call site is ready to return, it executes the return instruction:

ret                                # return to caller

This instruction also does two things: it pops the top of the stack, and then jumps to the address that was popped off.

Link Registers

This is all pretty reasonable so far: since function calls can have arbitrary nesting, we need to use a stack of some kind to store the function call stack data. There's not a whole lot of other options.

Actually, there's a neat trick that basically every modern architecture other than x86 implements. The technique is to use something called a "link register", henceforth called a LR. On some platforms there's a special register for this (e.g. the LR in ARM is actually called LR) while on some architectures it's done by convention (e.g. x1 is used for this purpose on RISC-V). The basic idea is that the return address for a function will be stored in this register, and then a "return" instruction will jump to the location pointed to by that register.

But wait: how does this work with function nesting? Imagine a calls b, and b calls c. In the call to c the LR will be set to some address in b. Then after the call completes the LR will have been clobbered, and b will have forgotten how to return to a.

The answer is that since the LR is indeed clobbered, in the general case you need to save the LR somehow before making another function call. Typically you would save the LR by pushing it onto the stack before the function call, and popping it after—just like we described for the x86 architecture. So we're back to where we started. Or are we?

Even with this restriction, one nice property of the LR system is that if a function calls multiple other functions we only need to save the LR once. For example, imagine this code (in a hypothetical assembly language):

example:
  push %lr                         # save the LR
  call foo                         # call foo()
  call bar                         # call bar()
  pop %lr                          # restore the LR
  ret                              # return to caller

In this example we did two function calls but only had to do one save/restore. So things are already better than x86. Pretty great, right?

Additionally, "leaf" functions that don't call other methods don't need to touch the stack at all, since all of the data they need will have been loaded directly into CPU registers.

The downside to this approach is that the explicit saves/restores of the LR mean that there are more instructions to execute. This is generally true of CISC vs RISC architectures: the CISC design leads to more compact instruction sequences, but (hopefully) the extra simplicity of the RISC architecture offsets that by allowing for faster or more efficient instructions.


Towers of Hanoi in x86 Assembly

One of the things I leaned this year is x86 assembly. The way that I learned it is wasn't actually by writing it, but by reading a lot of disassembly output. This means that although I'm pretty proficient at following along in assembly listings, I'm not that good at actually writing it from scratch. I decided to try to correct that, and today I wrote my first real x86 assembly program.

I decided to implement Towers of Hanoi for a few reasons:

The last point was especially important because I wanted to write my solution without linking against libc.

Throughout this post I'll have sample listings from my solution. If you want to see the whole thing though (say, you'd like to compile it yourself) you can find it here on GitHub.

This blog is a bit long, and you may want to skip towards the end. The structure of this post follows:

Getting Started: Towers Of Hanoi In C

I decided to start my journey by first implementing Towers of Hanoi in C. I did this paying special care to write it in a way that I though would translate easily into assembly. In particular, I wanted to minimize the calls I made to libc. I ended up only using two calls: calloc() to allocate my arrays, and putchar() for printing.

I have three towers, each is represented as a struct tower allocated on the stack that's like this:

struct tower {
  int size;
  int *stack;
};

The size field holds the number of rings stored in that tower's stack, and the stack field is a pointer to an array holding the stack. The base of the array is the value of the width of the bottom-most ring, and moving up in memory yields the values for the successive rings.

I define three operations: peek(), pop(), and push(). They have very straightforward implementations:

// return the value of the top ring
int peek(struct tower *tower) {
  if (tower->size) {
    return tower->stack[tower->size - 1];
  }
  return INT_MAX;
}

// remove the top ring and return its value
int pop(struct tower *tower) {
  assert(tower->size);
  int val = tower->stack[tower->size - 1];
  tower->stack[tower->size - 1] = 0;
  tower->size--;
  return val;
}

// push a new ring (without error checking)
void push(struct tower *tower, int val) {
  tower->stack[tower->size] = val;
  tower->size++;
}

I implemented an iterative solution to Towers of Hanoi, which is the most intuitive way to think about the problem. This is a good explanation of the algorithm, with diagrams. The idea is that there are three towers: the source tower, the aux tower, and the dest tower. The source tower is the one that starts out with rings. The other two towers are the aux/dest towers, and which one is which depends on if there is an even or odd number of rings.

The algorithm works by successively picking pairs of towers and then doing whatever is the valid move between those two towers. So given a pair, only one move will be legal, and the algorithm picks the legal move. That logic is implemented here by move():

// make the only valid move between a/b
void move(struct tower *a, struct tower *b) {
  if (peek(a) > peek(b)) {
    push(a, pop(b));
  } else {
    push(b, pop(a));
  }
}

The algorithm itself is like:

The algorithm takes (1<<N) - 1 moves for N rings, so we maintain a counter and stop after enough moves have passed. The full code for the solve loop is like this:

int solve(int n) {
  struct tower s, a, d;
  init_tower(&s, n);
  init_tower(&a, n);
  init_tower(&d, n);

  int i;
  for (i = 0; i < n; i++) {
    push(&s, n - i);
  }

  i = 0;
  int total_moves = (1 << n) - 1;
  for (;;) {
    print_all(&s, &a, &d, n);
    move(&s, &d);
    if (++i == total_moves) {
      break;
    }
    print_all(&s, &a, &d, n);
    move(&s, &a);
    if (++i == total_moves) {
      break;
    }
    print_all(&s, &a, &d, n);
    move(&a, &d);
    if (++i == total_moves) {
      break;
    }
  }
  print_all(&s, &a, &d, n);
  return 0;
}

There's one last thing needed here, the code for print_all(). I'm not going to include that here since it's boring, but the idea is you print the towers in order SDA if N is even, and in order SAD if N is odd.

Getting Started: Towers Of Hanoi In x86 Assembly

For the x86 code I implemented a similar solution. Since I don't have a lot of experience writing x86 I did things really weirdly, and have a calling convention kind of like the C one, but different.

The solve loop starts like this (with the number of rings in %rdi):

solve:
  push %rdi
  mov %rsp, %rbp

  # S tower
  sub $64, %rsp
  mov %rsp, %rdi
  call init_tower

  # A tower
  sub $64, %rsp
  mov %rsp, %rdi
  call init_tower

  # D tower
  sub $64, %rsp
  mov %rsp, %rdi
  call init_tower

So basically the value at %(rbp) will be the ring count, -64(%rbp) is the first tower, -128(%rbp) is the second tower, and -192(%rbp) is the third tower.

The init_tower method zeroes the 64 bytes starting at %rdi:

###  zero the memory in a tower
init_tower:
  xor %rax, %rax                # zero %rax
  movl $64, %ecx
  rep stosb
  ret

Next I need to initialize the rings in the first tower:

  # initialize S tower
  lea -64(%rbp), %rax
  mov (%rbp),%rcx
  mov %rcx, (%rax)              # set the size of the tower
  add $4, %rax
.init_s:
  mov %rcx, (%rax)
  add $4, %rax
  loop .init_s

This copies the ring count into the first 4 bytes of the first tower, and then for each 4 byte integer after that in descending order it stores the ring count. So let's say we're doing Towers Of Hanoi with N=3. Then the first tower will have memory initialized like {3, 3, 2, 1, 0, 0, ...}. The leading value is the number of rings, and then each value to the left is the width of the next ring.

Next I prepare to loop through all of the possible moves. I did this by storing the loop variable in %r15 and the total number of rings in %r14:

  mov (%rbp), %cl               # copy N into %cl
  mov $1, %r14                  # shift operand
  shl %cl, %r14                 # 1 << N
  dec %r14                      # now %r14 holds (1<<N)-1
  xor %r15,%r15                 # the loop variable

The core loop looks the same as it does in the C version. There's three similar blocks that each look like this:

.solve_loop:
  lea -64(%rbp), %rdi           # A tower
  lea -192(%rbp), %rsi          # D tower
  call move_tower

  inc %r15                      # increase loop counter
  cmp %r14, %r15                # compare to loop end
  jge .leave_solve              # leave if done

This one does the iteration checking S and D. The other blocks are the same except the first two lea instructions will be different to refer to S and A or A and D.

The next function is move_tower, which is the same as move() from the C code. It moves a ring from the tower with the smaller ring to the tower with the larger ring:

### rdi and rsi should be towers to move between
move_tower:
  ##  compare the top rings in the two towers
  mov %rdi, %r9
  call peek_tower
  mov %rax, %r10
  mov %rsi, %rdi
  call peek_tower
  mov %r9, %rdi
  cmp %rax, %r10
  jl .less_branch
.greater_branch:
  ## swap rdi and rsi
  mov %rdi, %rax
  mov %rsi, %rdi
  mov %rax, %rsi
.less_branch:
  ## source is rdi, dest is rsi
  call pop_tower
  push %rdi
  push %rsi
  mov %rsi, %rdi
  mov %rax, %rsi
  call push_tower
  pop %rsi
  pop %rdi
  jmp print_all

Here are peek_tower, pop_tower, and push_tower. They're all very straightforward translations from the C code:

###  peek the top of the tower
peek_tower:
  movl (%rdi),%ecx
  cmpl $0, %ecx
  jz .peek_empty
  dec %ecx
  movq 4(%rdi, %rcx, 4), %rax
  ret
.peek_empty:
  movl $100000, %eax
  ret

### like peek tower, but remove the value
pop_tower:
  movl (%rdi),%ebx
  dec %ebx
  movl 4(%rdi, %rbx, 4), %eax   # copy the value to %rax
  movl $0, 4(%rdi, %rbx, 4)     # zero the value
  mov %ebx, (%rdi)              # decrease count
  ret

### the tower in %rdi, the val in %rsi
push_tower:
  movl (%rdi),%ecx
  mov %rsi, 4(%rdi, %rcx, 4)
  inc %ecx
  mov %ecx, (%rdi)
  ret

That's pretty much it. Most of the rest of the code is logic for printing out the towers, and there's too much boiler plate to include here. The core of it is this method:

print_one:
  mov %rdi, %rsi
  movl $1, %eax                 # SYS_write
  movl %eax, %edi               # fd = 1
  movl %eax, %edx               # n = 1
  syscall
  ret

This prints the 1-byte buffer at %rdi. To convert ring values to characters I add the value of '0', same as in the C code. Printing a tower involves calling a loop with print_one pointing to a ring and then print_one pointing to a space. After the loop I call print_one with it pointing to a newline. The same trick of checking if there is an even or odd number of rings is needed in the print_all method, and I do that by checking the value in %(rbp) which is the same as when the program started.

Learnings

GDB Stuff

It actually took me quite a while to get the tower initialization and printing code up, which is the first code that I wrote. Since I didn't have the printing code working the only recourse I had for debugging was using GDB. I ended up staying in GDB for pretty much the whole writing/debugging process and learned a few things.

The first piece of advice I have is to use the layout regs mode in GDB when debugging assembly. It makes it really easy to singe-step and figure out what's changing when. I made a few early mistakes with misunderstanding which registers would be clobbered in a few situations, and the layout regs mode made it possible to solve this problem.

Another thing I found useful with GDB is inserting function labels at various places. You can put a function label "inside" another function and the code will fall through when calling the original function. This makes getting break points at specific parts of assembly routines really easy: insert a label with any name, and then set up a GDB break point at that label. The only downside I found to this is that it makes disassembly a bit strange, since if you disassemble the original function the disassembly output stops at the label you created.

I got to use the reverse debugging mode in GDB for the first time while writing this, to debug a segfault. The idea is that you record at some point before your program crashes. Then you can use reverse-stepi and reverse-stepn to backtrack from where the segfault happens. My segfault ended up being due to a misadjusted stack pointer when executing a ret instruction.

Code Lessons

I kind of made up my own calling convention as I wrote the program. It's similar to the regular x86 C calling convention but in some situations I decided to clobber extra registers when I knew the callers didn't need them. That works for a program that's really small like this, but it's not ideal. And it won't work at all if you need regular C interop.

In retrospect the program would almost certainly have been simpler had I just followed the C calling convention exactly.

I also neglected to use frame pointers, and I think that would have also made debugging simpler. Certainly the GDB output would have been more understandable.

I found the 32-bit/64-bit stuff really difficult to get right. A number of times I made mistakes where I used 64-bit values in 32-bit contexts and ran into problems. I still don't fully understand the rules on when I can use the 32-bit forms and when I have to use the 64-bit forms. This is something I need to come back to.

Next Steps

I plan to compare the code that GCC generated at optimizations levels -O0 and -O2 to my own code to see where I could improve. My initial thoughts are "probably everywhere" since even as I was writing this code I recognized a bunch of areas where I was doing redundant loads. I'm also going to start thinking of another program to do in x86, hopefully something more complicated with more memory management.


Go Compiles Infinite Loops Very Efficiently

This Go compiler will compile this code:

func init() {
start:
    goto start
}

into:

0000000000401000 <main.init.1>:
  401000:       eb fe                   jmp    401000 <main.init.1>

Declare Your Bash Variables

Everyone knows that "regular" Bash variables are just strings. Bash doesn't really know about integers or other numeric types. Right?

Not really. You can "declare" an integer variable like this:

declare -i foo

This makes foo into an integer type variable! Here's a more concrete example, showing what an integer type actually does:

# declare our variable as an integer type
declare -i foo

# assign 2 to foo and print it
foo=2
echo $foo              # prints 2

# what happens when we assign a string?
foo="hello, world"
echo $foo              # prints 0!

As you can see, once foo is declared as an integer variable it can no longer be coerced to a string type. There are other nifty things you can do with declare, such as the usage of declare -r for "readonly" variables. And more options are in the manual!

Bash scripting is notoriously difficult to get right. If you take the time to use the declare feature it can help you keep your Bash scripts readable and correct.


Localhost

As software engineers we all know about localhost, the hostname for the local machine. As we all know this is the same as 127.0.0.1, right?

Not so fast.

First, 127.0.0.1 does means localhost, but it actually means something more specific: IPv4 localhost. Of course there's another localhost for IPv6 and that's ::1. So we know that 127.0.0.1 means localhost, and we know that ::1 means localhost. But what does localhost mean? Well, it could refer to either the IPv4 loopback address or the IPv6 loopback address.

Here's what's in /etc/hosts on my machine (Fedora 24):

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

As you can see localhost matches both 127.0.0.1 and ::1. So what are the rules for what glibc will do in this situation when resolving localhost? What are the rules for what Python will do when connecting? Will getaddrinfo(2) prefer the IPv4 address or the IPv6 address, and will it even be consistent? As you can see there's a lot of ambiguity here.

Ideally all of your software works seamlessly with both IPv4 and IPv6 and this isn't an issue. But if you rely on using localhost know that there is ambiguity, and that you can't rely on using one protocol or the other.

The other issue to be wary of is how software represents remote addresses. Most software programs will happily resolve localhost when making outgoing connections, but typically they won't try to map an incoming address of 127.0.0.1 or ::1 back to a human readable name like localhost. So if you're used to using localhost there's going to be an asymmetry, where your logs are filled with IP addresses but your configs use "localhost". This can lead to nasty surprises in unexpected situations, such as having a Varnish rule that matches against the regex ^localhost fail because Varnish represents the remote peer name as 127.0.0.1.

I recommend that you pick either 127.0.0.1 or ::1 and stick to it consistently. By being explicit about the protocol and using the same format that the machine uses when displaying socket addresses you'll run into a lot fewer surprises.


Gocd

I've been writing a lot of Go recently at work, and I made an interesting tool for navigating my Go projects. The problem I ran into is that Go enforces a rigid directory structure that is very nested. I have my GitHub projects in ~/go/src/github.com/ORG/REPO, my work projects in ~/go/src/my.work.place/WORKORG/REPO, and various other more complicated directory layouts for projects cloned from other places. Remembering where I had cloned things was a challenge and a lot of typing.

I wanted to be able to type something like gocd foo and then automatically be taken to the right project directory for foo, wherever that was. My first attempt at this was in Bash and looked like this:

gocd() {
    local bestdir=$(find "${GOPATH}/src" -type d -name "$1" | awk '{print length, $0}' | sort -n | head -n 1 | cut -d' ' -f2-)
    if [ -n "$bestdir" ]; then
        cd "$bestdir"
    else
        return 1
    fi
}

This finds the shortest matching directory in my $GOPATH and worked great. The problem is that it also explores .git/ and vendor/ directories. This isn't a problem when my filesystem caches are warmed up, but when they're cold (e.g. after booting) this command runs slowly if there are a lot of repos checked out.

I had the idea to rewrite this in C++ with the following design constraints:

I coded this all up and it's indeed very fast. I can find a project in under 20 milliseconds with my filesystem caches completely cold, compared to more than one second before. When the filesystem caches are warm the command completes in one or two milliseconds.

The current implementation uses a std::vector<std::string> of directory candidates, where the vector is sorted by string size. This allows finding the best candidate efficiently (it's at the back of the vector). For the same reason each candidate can be removed from the search list efficiently, since it just requires popping the back element from the vector. In theory this could be made even more efficient by either using my own custom merge sort (since I have one sorted vector and one unsorted vector), or by using a min-heap keyed on string size. In practice I'm not sure that either of these would actually improve things since my current solution correctly optimizes for minimizing disk I/O which is by far the most expensive part of the search.

To actually use this I install the C++ executable as _smartcd and then have a Bash wrapper called gocd() to tie things together:

gocd() {
    if [ -z "${GOPATH}" ]; then
        return 1;
    fi
    local best=$(_smartcd "${GOPATH}/src" "$1")
    if [ -n "$best" ]; then
        cd "$best"
    else
        return 1
    fi
}

If you find this type of thing useful you can find it on my GitHub at github.com/eklitzke/tools. There are some other CLI tools in here written in C++ that I use for other things in my Bash prompt too. As with all my other projects, all of the code is licensed under the GPL.


Using ssh-agent and ED25519 Keys on GNOME

The default GNOME keyring (gnome-keyring-daemon) provides an inferior version of ssh-agent. The GNOME version doesn't support ED25519 keys, and it doesn't have proper support for removing keys. I figured out a kind of elegant way to fix this in Fedora 24 which should be compatible in the future with the bright and glorious Wayland future, including with gdm-wayland-session.

First I disabled the SSH component of gnome-keyring-daemon by removing the Autostart lines from /etc/xdg/autostart/gnome-keyring-ssh.desktop. From a clean install of GNOME/Fedora you should see after doing this that upon login SSH_AUTH_SOCK is no longer set.

Next I created a systemd user unit file at ~/.config/systemd/user/ssh-agent.service with the following content:

[Unit]
Description=OpenSSH private key agent
IgnoreOnIsolate=true

[Service]
Type=forking
Environment=SSH_AUTH_SOCK=%t/ssh-agent.socket
ExecStart=/usr/bin/ssh-agent -a $SSH_AUTH_SOCK
ExecStartPost=/usr/bin/systemctl --user set-environment SSH_AUTH_SOCK=${SSH_AUTH_SOCK}

[Install]
WantedBy=default.target

I also enabled this unit with systemctl --user enable ssh-agent.service. After doing this, upon logging in to a GNOME session you should see that SSH_AUTH_SOCK is still not set, but now you will see that an ssh-agent process is started with a command like /usr/bin/ssh-agent -a /run/user/1000/ssh-agent.socket. You'll also see that systemctl --user show-environment has the correct value for SSH_AUTH_SOCK.

Finally I put this in my ~/.bash_profile:

eval $(systemctl --user show-environment | grep SSH_AUTH_SOCK)
export SSH_AUTH_SOCK

This will cause the right value for SSH_AUTH_SOCK to be propagated to your X11 session. This works because gdm-x-session sources ~/.bash_profile when logging in (or at least it does on Fedora).

From what I understand from the GNOME bugzilla, gdm-wayland-session will automatically know how to get SSH_AUTH_SOCK from the systemd user session, I believe starting in GNOME 3.22. This means in the future you won't need these lines in ~/.bash_profile.


Systemd Socket Activation

I recently wrote my first service that has (optional) systemd socket activation, and I have to say: I really like it.

It's really easy to use, and it makes writing services with graceful restarts really easy. In the simplest usage you add a small amount of boilerplate to your code:

int listen_sock;
int fd_count = sd_listen_fds(0);
if (fd_count == 1) {
  listen_sock = SD_LISTEN_FDS_START;
} else {
  // set listen_sock w/ usual bind/listen stuff here
}
struct sockaddr addr;
socklen_t addrlen;
while (int client_sock = accept(listen_sock, &addr, &addrlen)) {
  // accept loop here
}

The only code here that's different is that we try to sd_listen_fds() before doing bind()/accept(). Then if the process has been started in socket activation mode the process will start with a ready listen socket. There's also some systemd configuration you have to write to use this; namely you'll have to create a short file name myservice.socket that has the listen address, and then:

systemctl enable myservice.socket
systemctl start myservice.socket

Just with this code things are already an improvement over usual because when you restart the process (say, because you're upgrading the code) the listen socket will be created by systemd as soon as the old process exits, meaning that there's a shorter gap when no listen socket is available to clients.

There's also an easy way to do completely graceful process restarts using sd_notify_with_fds(). Briefly, the way this works is you call this while your process is exiting you use this API call to hand file descriptors (along with their "names") back to systemd. Then when the new process starts up it can request those same file descriptors back with sd_listen_fds_with_names(). You can control just the listen socket this way (so there's no gap whatsoever where the listen socket is closed), and you can also use this mechanism to transfer client sockets: it's up to you. If you want to transfer client sockets you'll have to come up with your own serialization format to send any connection state over. Systemd has some helpers for this too.

There have always been ways to do this: you've been able to transfer file descriptors on Unix systems over AF_UNIX systems for eternity. But systemd makes this process much easier, and does it in a way that's integrated with init which is really nice.


Esc is Ctrl-[

I've been a user of vi basically since I started programming.[1] The problem that every vi user has is that on most keyboards the Esc key is in an inconvenient location. I tried a number of tricks like remapping Esc and Caps Lock, but wasn't super happy with any of them.

Then one day I learned that Esc generates the same key sequence as Ctrl-[. Since that day I've abandoned the Esc key and solely rely on Ctrl-[. The [ key is already convenient enough for me (and one I have to type a lot anyway). Since I don't need Caps Lock I typically make mine into an extra Ctrl key. This is great because it gives me a whopping three Ctrl keys (be still, my beating heart). The Ctrl-[ sequence is easy to type since it uses both hands.

I don't know of anyone else in the habit of using Ctrl-[ this way, but I quite like it. If you're a vi user give it a shot, you might find it suits you too.

[1] This is actually a half-truth—I've been using Emacs with evil for a number of years now.


F-Market Line Considered Harmful

Over the last few years the city of San Francisco has done a lot of work to make the city more bicycle friendly, for which I sincerely applaud them. Initiatives like Vision Zero and the good work from the SF Bicycle Coalition have made a huge, positive impact on cycling in San Francisco. However, there's still a lot of work to be done, and in my opinion removal of the train tracks on Market St should be at or near the top of that list.

Light rail train tracks pose a huge problem to cyclists. The tracks are slick metal with grooves just wide enough to fit bicycle tires. A cyclist crossing the tracks with an insufficient angle will have their wheel caught in the tracks which will cause them to crash. This is especially dangerous when the front wheel is caught in the tracks (the most common case for unwary cyclists) since it will typically result in the handlebars suddenly twisting and getting stuck which will throw the cyclist over the handlebars and the front of the bike, frequently head first. I've witnessed a lot of accidents due to train tracks, and crashed myself due to them.

While light rail tracks can be dangerous to cyclists, light rail is also typically very convenient for commuters. There are two factors that make light rail especially convenient, for instance compared to buses. The first is that in situations where the light rail is underground or elevated, the trains can bypass normal street traffic. The second is that light rail trains can be easily made quite long, which give them greater carrying capacity than buses. There are also a number of things cities can do to mitigate the danger that light rail poses to cyclists. The best solution is to run trains below ground or on elevated platforms, as this puts the trains completely out of the way of cyclists. This has the added benefit of making the trains run faster, as already mentioned. When this isn't possible the next best thing is to avoid putting train tracks parallel to where cyclists are likely to be, and ideally to create convenient cycling corridors on adjacent streets to encourage cyclists to be out of the way of these train tracks.

San Francisco's F-Market & Wharves line (henceforth abbreviated "F-Market"), also known as the "Historic Streetcars" line, is a great example of all of the worst factors of light rail convenience and bicycle safety combined. The F-Market runs from 17th & Castro down the length of Market St to the Embarcadero, and then north someways along the Embarcadero towards Fisherman's Wharf. The line operates a "heritage streetcar service", meaning that the trains used are vintage streetcars from light rail systems around the country.

My main criticism of this line is that for nearly its entire route down Market St the train runs parallel to and isn't separated from traffic or cyclists. Technically the tracks don't enter the bike lane, but for most of its length the bike lane runs right alongside it. Cyclists regularly need to cross the tracks in routine traffic, for instance to maneuver around stopped vehicles. I have to imagine that these tracks account for the most dangerous hazard to most cyclists in the city, simply because of their placement and because of the fact that Market St is so heavily trafficked by cyclists. There are other light rail lines in the city that aren't separated from traffic (e.g. the N Judah line), but those lines aren't on major cycling corridors and therefore in practice are significantly less dangerous to cyclists.

Furthermore the F-Market line isn't really convenient to anyone, at least not on the stretch down Market St. There are subway tracks underneath Market that serve all of the "Muni Metro" lines, the other light rail system. That is, the subway under Market St serves Metro lines J, K, L, M, N, and T. The Metro lines are much faster because they are underground and therefore avoid traffic. The Metro trains are also typically of much higher capacity since they use standard train cars instead of repurposed "heritage" cars. If the city were to restrict the F-Market line solely to its route along the Embarcadero (where it is faster and safer since it is separated from other vehicle traffic) very few people would be inconvenienced.

What makes this even more infuriating is that there are a lot of regular Muni bus lines already on Market St. The buses are actually cheaper to operate than light rail anyway. People who don't have convenient access to the Metro system or are traveling to a part of the city not served by one of these lines are already taking the other Muni buses operating along Market St. In my mind there is almost no practical reason for the F-Market line to operate on Market St at all. The only function it serves is for the aesthetic appeal of the historic streetcars. But this comes at a grave cost to cyclist safety.

If San Francisco is serious about improving cycling convenience and safety I implore them to consider restricting the F-Market line to its Embarcadero stretch, and to pave over the tracks on Market St.


Sponge

Joey Hess has a collection of general purpose Unix utilities called moreutils. This is available on pretty much every Linux distribution I know of under the unsurprising package name moreutils. There are a few good things in here, but my favorite by far is the sponge(1) command.

Explaining what sponge is and why it's useful is easiest with an example. Suppose that you've done some complex shell pipeline and redirected the output to a file called file.txt. You realize that you've accidentally included some extra lines in the file, and you want to remove them using grep. Conceptually what you want to do is:

# incorrectly try to remove lines from file.txt
grep -v badpattern file.txt > file.txt

The issue is that this doesn't work—or at least, it doesn't do what you might expect. The problem is that when you use the > operator in bash it immediately opens the file for writing, truncating it. This happens before the first part of the pipeline runs: in this case, before the grep command is invoked. This means that when grep runs it will see file.txt as an empty file and consequently this shell invocation will always leave you with an empty file.

One way to fix this is to use an intermediate output file, perhaps like this:

# correct version, using a temporary file
grep -v badpattern file.txt > tmp.txt
mv tmp.txt file.txt

This works because the output is fully written to a new file before replacing the input file. But it's two steps, and if you're like me you're likely to end up with a lot of random semi-processed files that you need to clean up later.

There's a better way to do this using sponge. You do it like this:

# correct version, using sponge
grep -v badpattern file.txt | sponge file.txt

What happens here is that sponge will buffer all of the data it gets on stdin into memory. When it detects the EOF condition on stdin it will then write all of the data it buffered to a file named by its argument. By convention you would use the input file as this argument. The end result is that file.txt won't be truncated until after it's been fully read by the left hand side of the pipe. The only caveat to be aware of is that because the output is first buffered into memory, you may run into problems if the output file is too large (i.e. larger than the amount of free memory you have). However I've very rarely found that to be the case, and I'm a happy regular user of sponge.


5-HTP Contraindications

I recently came across a really interesting article called "5-HTP efficacy and contraindications" on the NIH website. I found the article because I had been orally supplementing my diet with 200mg 5-HTP capsules, and I wanted to do some more research on what evidence there was for the efficacy of 5-HTP. I found the article to be really surprising, and I want to summarize it here. Of course if you have any questions, or want to know more about this, it's better to just read the article itself rather than my summary here. Disclaimer: I am not a doctor or medical professional.

5-HTP is a precursor to the neurotransmitter serotonin. Serotonin is a critical neurotransmitter that plays a number of functions in the brain, among them contributing to feelings of well-being and happiness. Low serotonin levels have been linked with depression and other mood disorders (although not all cases of depression are caused by low serotonin levels). Serotonin does not cross the blood-brain barrier, which means that serotonin levels in the brain cannot be increased by direct supplementation. However, 5-HTP does cross the blood-brain barrier. In the brain 5-HTP is metabolized by an enzyme called AAAD into serotonin. Therefore oral supplementation of 5-HTP can increase latent serotonin levels in the brain. The pop science marketing of 5-HTP supplements suggest that by taking 5-HTP you can improve mood and happiness, and possibly even counteract the effects of depression, since taking 5-HTP can increase serotonin levels.

What the study I linked to shows is that not only is there poor evidence that 5-HTP supplementation alone can improve mood or combat depression, there's actually a lot of circumstantial evidence to suggest that it can worsen those effects. That is what is meant by the medical term contraindication—that 5-HTP supplementation can potentially have a deleterious impact on mental health.

The article first summarizes some of the existing studies on 5-HTP supplementation. It points out that studies where 5-HTP was administered alone have not shown significant improvement for treating disorders like depression. Then it points out that in studies where 5-HTP supplementation has shown to be effective, that those studies were also controlling for dopamine levels with another drug like carbidopa. From what I gathered from the article, the current state of the art in medicine allows doctors to measure serotonin and dopamine levels and to control them carefully in a controlled setting. However, over the counter oral 5-HTP supplementation isn't a controlled or monitored setting, and without the administration of a drug like carbidopa one isn't controlling latent dopamine levels.

The article goes on to point out that there's a lot of evidence to suggest that oral 5-HTP supplementation can actually cause dopamine levels to become imbalanced. The synthesis of 5-HTP into serotonin is catalyzed by the enzyme AAAD, and that same enzyme catalyzes L-DOPA into dopamine. Excess 5-HTP can compete with the L-DOPA reaction and therefore cause dopamine levels to be depressed. There's evidence to show that other monoamine pairs can get out of balance when one monoamine precursor dominates another. Low dopamine levels are associated with a number of disorders, notably Parkinson's disease. Low dopamine levels have also been implicated in depression. Therefore there is evidence to suggest that taking 5-HTP to increase serotonin levels can actually cause lowered dopamine levels, and therefore can potentially be implicated in worsening symptoms of depression.

On the basis of this article it doesn't seem safe to me to take supplementary 5-HTP for extended periods of time, or possibly at all.