repz stosb

Crosh/SSH and ChromeOS

I have a Chromebook Pixel, which I like a lot and use regularly. Generally I use my Chromebook for non-coding tasks; for coding I prefer to use my Thinkpad (which has Fedora installed) because it has a "real" coding environment.

However, sometimes I'll code on my Chromebook as well, e.g. when I'm traveling and it's more convenient to bring with me. When I write code on my Chromebook I use a Debian chroot environment via crouton to do development. This is a pretty reasonable facsimile of writing code on a "real" Linux laptop: you can use things like Emacs, GCC, GDB, and all of the other GNU/Linux tools you know and love.

The biggest pain point of doing development this way is the window management in ChromeOS. I like having lots of terminal windows open, and ChromeOS makes it hard to do this. The "normal" and well-documented way to create a new terminal is to type Ctrl+Alt+T to get a new crosh window to appear as a new tab in a browser window. From the crosh tab you can get a terminal by typing shell. But this is pretty inconvenient: the new crosh window starts out as a browser tab, not a new window, and there's a lot typing.

There's an extremely poorly documented feature of the SSH app that makes this much easier. From the SSH app you can get a new crosh terminal window by setting the host to be the magic string >crosh and then supplying any username. This way when you type Ctrl+Shift+N from an existing crosh shell, a new window will pop up ready to become a terminal.

x86 Register Encoding

If you've ever look at the disassembly output for C or C++ code, you'll probably notice that there are a lot of push/pop instructions. And if you pay close enough attention, you'll notice that the compiler prefers to use certain registers over others. In particular, compilers will prefer pushing "old" registers like RBP (i.e. the ones available on 32-bit x86 CPUs) instead of the "new" registers like R15 (which aren't available in 32-bit mode).

The C calling convention on x86 systems specifies that callees need to save certain registers. There are a few different names for these kinds of registers, such as nonvolatile registers, callee-saved registers, and so on.

If a method needs to use some registers, it's best to use the volatile registers, since they don't require additional push/pop instructions. However, there are only a few volatile registers. If a method needs additional registers it will have to dip in to the nonvolatile set. These nonvolatile registers must be pushed on function entry, and popped on function exit. So that explains the first part: why these registers are pushed/popped at all.

But what about the second part: why does the compiler prefer pushing/popping the old registers RBX, RBP, RDI, RSI, and RSP over the new registers R12, R13, R14, and R15?

The answer lies in the historical legacy of how instruction encoding worked on 32-bit systems. First let's look at a chart showing the different general purpose registers, and their characteristics:

Number Register Volatile? Old/New
0 RAX Yes Old
1 RCX Yes Old
2 RDX Yes Old
3 RBX No Old
4 RSP No Old
5 RBP No Old
6 RSI No Old
7 RDI No Old
8 R8 Yes New
9 R9 Yes New
10 R10 Yes New
11 R11 Yes New
12 R12 No New
13 R13 No New
14 R14 No New
15 R15 No New

Since the designers of x86 knew that these registers were going to be pushed/popped all the time, they wanted to try to make the push/pop instructions really compact. So they reserved one-byte instruction encodings to push/pop every register. This is pretty unusual: there aren't too many instructions that can be encoded with a single byte. The one-byte instruction encodings are only used for the most common instructions.

To push a register, you take the number in the chart above and add it to 0x50. So if you want to push RSP, the instruction is 0x54, which is 0x50 + 4.

To pop a register, you take the number in the chart above and add it to 0x58. So if you want to pop RSP, the instruction is 0x5c, which is 0x58 + 4.

As you can see, they only reserved space for eight registers when pushing; the same is true when popping. This makes sense, because at the time there were only eight general purpose registers. However, this is a problem because no space was reserved for the higher numbers.

When they designed the 64-bit versions of x86 they came up with a clever solution for this problem, aimed at keeping backwards compatibility with 32-bit systems. They added new prefix instructions to indicate that certain fields should be the extended versions. The details of how this work are a bit complicated, but for a push or pop instruction the prefix 0x41 means that the register should be considered the extended version, and the register number is then subtracted by eight when encoding.

Here's an example. Suppose we want to push R9. The first byte of the instruction is 0x41. The second byte is 0x50 + (9 - 8) = 0x51. Thus the full encoding will be 0x4151.

Suppose we want to pop R14. The encoding will be 0x41 followed by 0x58 + (14 - 8) = 0x5e. Thus the fully encoded instruction will be 0x415e.

As you can see, the original eight registers have more compact encodings: they can each be pushed and popped with a single byte, whereas the new registers require two bytes to push/pop. This applies to certain other instructions too, not just push/pop. The actual time it takes to execute a push/pop is the same either way, so there's not any actual CPU cycles saved. But using smaller instructions means a slightly smaller executable, means less data for the decoding pipeline to process, and means that instructions are more likely to stay in caches. So if you can, it's better to use the old registers: you'll save a byte for each push/pop.

Pyflame Dual Interpreter Mode

I recently implemented "dual interpreter mode" for Pyflame, which allows Pyflame to be compiled to target both Python 2 and Python 3 at the same time, in the same executable. This is extremely unusual, and Pyflame is the only Python project implemented in C/C++ that I am aware of that has this feature. In this post I'll explain how I implemented this feature for Pyflame.

The Problem

In order for Pyflame to work, it has to know a lot of details about the internals of the Python interpreter. The most important thing it must know is the struct offsets for various fields in Python objects. As an example, Pyflame needs to know the offsets for where to find things like a frame's pointer to a "code" object. If Pyflame thinks that the code object is at offset 48, but it's actually at offset 56, then Pyflame will get GIGO when trying to decode the stack.

Fortunately you can get all of these offsets from Python.h, and this is exactly what Pyflame does, and has always done. Unfortunately, these struct offsets differ between Python 2 and Python 3. This means that when you compile Pyflame you either give it the Python.h for Python 2, or the Python.h for Python 3, and the resulting Pyflame executable can only profile that version of Python.

There's another complication, which is that Python 2 and Python 3 declare many of the same symbols. This means that even aside from this struct offset issue, normally you wouldn't be able to compile an executable that links against say libpython2.7.so and libpython3.5m.so at the same time.

However, Pyflame isn't a normal C++ program. It actually only uses the Python headers to get struct offsets, and does not link against libpython. So in principle you could come up with a way to build a "dual" Pyflame executable that can profile both Python 2 and Python 3 processes.

While it's an interesting thought experiment to think about how to build a Pyflame that can support both Python interpreter versions at once, it's not really that useful. Most people are either using Python 2 or Python 3, so just supporting one at compile time is not a big deal. People who need both Python versions can just compile two versions. So I had created an issue to remind myself to look into this, but I had considered it very low priority.

This changed recently when I decided to try to get Pyflame into Fedora, and it occurred to me that if I actually did this crazy dual-interpreter mode it would make my packaging life a lot easier. Instead of maintaining python2-pyflame and python3-pyflame, I'd be able to just add a single package. And since there's no linking dependency, I can support both Python interpreters essentially for free. So off I went.

The Solution

There's two parts to solving this. The first is how the code is refactored to support two Python releases with minimal code duplication. The second part is how the build system (autoconf/automake) needed to be changed.

If you'd like to follow along with the changes, please see PR 42.

Code Changes

The code for Python 2 and Python 3 is 95% the same in my estimation. The struct offsets in Python 2 and Python 3 do differ, but other than that the only material change is how strings work in both releases, which is easy to work around with preprocessor macros, and which I had already done.

The solution I came up with here is to define a filed called frob.cc which implements all of the Python internals logic. This file includes Python.h as usual. It has the following compile-time logic:

There are two stub files that include frob.cc: frob2.cc includes it in a way so it's configured to build for Python 2, frob3.cc includes it in a way so it's configured to build for Python 3. The file frob.cc itself is never built into an object, only frob{2,3}.cc are actually compiled and linked.

There's another set of files called pyfrob.{cc.h} that have the following logic:

The way I implemented this Pyflame will do all of this runtime logic just once when Pyflame starts up. Then while it's running it will invoke the interpreter-specific bits using function pointers. This is a pretty small optimization, but avoids additional runtime branching.

Most of the work here was actually refactoring the existing code to be consolidated into fewer files, and the new logic for detecting the Python version. I ended up touching most of the Pyflame codebase to get this to work. The preprocessor macros are pretty hairy in my opinion, but ended up working out fine.

Automake Changes

There's a lot of compilation logic that needs to change for this to work:

The hardest part of this was figuring out how to compile frob2.cc and frob3.cc with different include paths. I found an automake documentation page called Per-Object Flags Emulation. which is short, but does cover how to do this. I actually ended up bringing in libtool (which is a compile-time dependency only) since it provides some convenience methods.

I also had to change a lot of logic in my configure.ac so it would know how to pick between the two. The current solution detects what Python releases are on the system, and enables all of the supported ones. I'm not super happy with this: in the future I'll probably revisit the code to allow building against just one release or another.

Next Steps

Once I get my PR reviewed and landed I'm going to tag a major new version of Pyflame, and then try to base my Fedora package submission on that. The next major feature I'm adding to Pyflame will be from issue 13. This issue describes rewriting the code to not use _PyThreadState_Current, and instead find the global "interpreter" list and use that to find the threads. This will let me get stack traces from idle threads which has a ton of really interesting use cases.

Call Stacks and Link Registers

When your computer executes code, there's an implicit call stack at any given point in time. The call stack holds the necessary information for functions to return to their caller. For instance, let's say we have this C code:

void foo(int n) {
  for (int i = 0; i < n; i++) {
    printf("i = %d\n", i);

After each call to printf(), the routine needs to return to a specific place in foo(). With nesting of functions this gets even more complicated:

void bar(int n) {
  if (n % 2) {
    puts("odd branch");
  } else (
    foo(n / 2);
    puts("even branch");

Now printf() needs to remember how to return to foo(), and foo() has to remember how to return to two different places within bar(). And presumably there's some other function that's called bar(), and bar() needs to know how to return to that.

The obvious data structure to hold this information is a stack. The way this might work is you'd push a return address onto this stack when making a function call, and then when you wanted to return to your caller you'd pop the return value off the stack.

In fact, this is exactly how pretty much every computer architecture works. The complication, however, is that there isn't actually a dedicated stack for call information. Instead the regular program stack is used to store the return addresses. Thus these return addresses are intermingled with regular program data on the stack.

Specifically here's how it works in x86 assembly. You call a function like this:

call function_or_address           # function call

The call does two things: it pushes the the address of the next instruction onto the stack, and then it jumps to the call address. Later, when the call site is ready to return, it executes the return instruction:

ret                                # return to caller

This instruction also does two things: it pops the top of the stack, and then jumps to the address that was popped off.

Link Registers

This is all pretty reasonable so far: since function calls can have arbitrary nesting, we need to use a stack of some kind to store the function call stack data. There's not a whole lot of other options.

Actually, there's a neat trick that basically every modern architecture other than x86 implements. The technique is to use something called a "link register", henceforth called a LR. On some platforms there's a special register for this (e.g. the LR in ARM is actually called LR) while on some architectures it's done by convention (e.g. x1 is used for this purpose on RISC-V). The basic idea is that the return address for a function will be stored in this register, and then a "return" instruction will jump to the location pointed to by that register.

But wait: how does this work with function nesting? Imagine a calls b, and b calls c. In the call to c the LR will be set to some address in b. Then after the call completes the LR will have been clobbered, and b will have forgotten how to return to a.

The answer is that since the LR is indeed clobbered, in the general case you need to save the LR somehow before making another function call. Typically you would save the LR by pushing it onto the stack before the function call, and popping it after—just like we described for the x86 architecture. So we're back to where we started. Or are we?

Even with this restriction, one nice property of the LR system is that if a function calls multiple other functions we only need to save the LR once. For example, imagine this code (in a hypothetical assembly language):

  push %lr                         # save the LR
  call foo                         # call foo()
  call bar                         # call bar()
  pop %lr                          # restore the LR
  ret                              # return to caller

In this example we did two function calls but only had to do one save/restore. So things are already better than x86. Pretty great, right?

Additionally, "leaf" functions that don't call other methods don't need to touch the stack at all, since all of the data they need will have been loaded directly into CPU registers.

The downside to this approach is that the explicit saves/restores of the LR mean that there are more instructions to execute. This is generally true of CISC vs RISC architectures: the CISC design leads to more compact instruction sequences, but (hopefully) the extra simplicity of the RISC architecture offsets that by allowing for faster or more efficient instructions.

Towers of Hanoi in x86 Assembly

One of the things I leaned this year is x86 assembly. The way that I learned it is wasn't actually by writing it, but by reading a lot of disassembly output. This means that although I'm pretty proficient at following along in assembly listings, I'm not that good at actually writing it from scratch. I decided to try to correct that, and today I wrote my first real x86 assembly program.

I decided to implement Towers of Hanoi for a few reasons:

The last point was especially important because I wanted to write my solution without linking against libc.

Throughout this post I'll have sample listings from my solution. If you want to see the whole thing though (say, you'd like to compile it yourself) you can find it here on GitHub.

This blog is a bit long, and you may want to skip towards the end. The structure of this post follows:

Getting Started: Towers Of Hanoi In C

I decided to start my journey by first implementing Towers of Hanoi in C. I did this paying special care to write it in a way that I though would translate easily into assembly. In particular, I wanted to minimize the calls I made to libc. I ended up only using two calls: calloc() to allocate my arrays, and putchar() for printing.

I have three towers, each is represented as a struct tower allocated on the stack that's like this:

struct tower {
  int size;
  int *stack;

The size field holds the number of rings stored in that tower's stack, and the stack field is a pointer to an array holding the stack. The base of the array is the value of the width of the bottom-most ring, and moving up in memory yields the values for the successive rings.

I define three operations: peek(), pop(), and push(). They have very straightforward implementations:

// return the value of the top ring
int peek(struct tower *tower) {
  if (tower->size) {
    return tower->stack[tower->size - 1];
  return INT_MAX;

// remove the top ring and return its value
int pop(struct tower *tower) {
  int val = tower->stack[tower->size - 1];
  tower->stack[tower->size - 1] = 0;
  return val;

// push a new ring (without error checking)
void push(struct tower *tower, int val) {
  tower->stack[tower->size] = val;

I implemented an iterative solution to Towers of Hanoi, which is the most intuitive way to think about the problem. This is a good explanation of the algorithm, with diagrams. The idea is that there are three towers: the source tower, the aux tower, and the dest tower. The source tower is the one that starts out with rings. The other two towers are the aux/dest towers, and which one is which depends on if there is an even or odd number of rings.

The algorithm works by successively picking pairs of towers and then doing whatever is the valid move between those two towers. So given a pair, only one move will be legal, and the algorithm picks the legal move. That logic is implemented here by move():

// make the only valid move between a/b
void move(struct tower *a, struct tower *b) {
  if (peek(a) > peek(b)) {
    push(a, pop(b));
  } else {
    push(b, pop(a));

The algorithm itself is like:

The algorithm takes (1<<N) - 1 moves for N rings, so we maintain a counter and stop after enough moves have passed. The full code for the solve loop is like this:

int solve(int n) {
  struct tower s, a, d;
  init_tower(&s, n);
  init_tower(&a, n);
  init_tower(&d, n);

  int i;
  for (i = 0; i < n; i++) {
    push(&s, n - i);

  i = 0;
  int total_moves = (1 << n) - 1;
  for (;;) {
    print_all(&s, &a, &d, n);
    move(&s, &d);
    if (++i == total_moves) {
    print_all(&s, &a, &d, n);
    move(&s, &a);
    if (++i == total_moves) {
    print_all(&s, &a, &d, n);
    move(&a, &d);
    if (++i == total_moves) {
  print_all(&s, &a, &d, n);
  return 0;

There's one last thing needed here, the code for print_all(). I'm not going to include that here since it's boring, but the idea is you print the towers in order SDA if N is even, and in order SAD if N is odd.

Getting Started: Towers Of Hanoi In x86 Assembly

For the x86 code I implemented a similar solution. Since I don't have a lot of experience writing x86 I did things really weirdly, and have a calling convention kind of like the C one, but different.

The solve loop starts like this (with the number of rings in %rdi):

  push %rdi
  mov %rsp, %rbp

  # S tower
  sub $64, %rsp
  mov %rsp, %rdi
  call init_tower

  # A tower
  sub $64, %rsp
  mov %rsp, %rdi
  call init_tower

  # D tower
  sub $64, %rsp
  mov %rsp, %rdi
  call init_tower

So basically the value at %(rbp) will be the ring count, -64(%rbp) is the first tower, -128(%rbp) is the second tower, and -192(%rbp) is the third tower.

The init_tower method zeroes the 64 bytes starting at %rdi:

###  zero the memory in a tower
  xor %rax, %rax                # zero %rax
  movl $64, %ecx
  rep stosb

Next I need to initialize the rings in the first tower:

  # initialize S tower
  lea -64(%rbp), %rax
  mov (%rbp),%rcx
  mov %rcx, (%rax)              # set the size of the tower
  add $4, %rax
  mov %rcx, (%rax)
  add $4, %rax
  loop .init_s

This copies the ring count into the first 4 bytes of the first tower, and then for each 4 byte integer after that in descending order it stores the ring count. So let's say we're doing Towers Of Hanoi with N=3. Then the first tower will have memory initialized like {3, 3, 2, 1, 0, 0, ...}. The leading value is the number of rings, and then each value to the left is the width of the next ring.

Next I prepare to loop through all of the possible moves. I did this by storing the loop variable in %r15 and the total number of rings in %r14:

  mov (%rbp), %cl               # copy N into %cl
  mov $1, %r14                  # shift operand
  shl %cl, %r14                 # 1 << N
  dec %r14                      # now %r14 holds (1<<N)-1
  xor %r15,%r15                 # the loop variable

The core loop looks the same as it does in the C version. There's three similar blocks that each look like this:

  lea -64(%rbp), %rdi           # A tower
  lea -192(%rbp), %rsi          # D tower
  call move_tower

  inc %r15                      # increase loop counter
  cmp %r14, %r15                # compare to loop end
  jge .leave_solve              # leave if done

This one does the iteration checking S and D. The other blocks are the same except the first two lea instructions will be different to refer to S and A or A and D.

The next function is move_tower, which is the same as move() from the C code. It moves a ring from the tower with the smaller ring to the tower with the larger ring:

### rdi and rsi should be towers to move between
  ##  compare the top rings in the two towers
  mov %rdi, %r9
  call peek_tower
  mov %rax, %r10
  mov %rsi, %rdi
  call peek_tower
  mov %r9, %rdi
  cmp %rax, %r10
  jl .less_branch
  ## swap rdi and rsi
  mov %rdi, %rax
  mov %rsi, %rdi
  mov %rax, %rsi
  ## source is rdi, dest is rsi
  call pop_tower
  push %rdi
  push %rsi
  mov %rsi, %rdi
  mov %rax, %rsi
  call push_tower
  pop %rsi
  pop %rdi
  jmp print_all

Here are peek_tower, pop_tower, and push_tower. They're all very straightforward translations from the C code:

###  peek the top of the tower
  movl (%rdi),%ecx
  cmpl $0, %ecx
  jz .peek_empty
  dec %ecx
  movq 4(%rdi, %rcx, 4), %rax
  movl $100000, %eax

### like peek tower, but remove the value
  movl (%rdi),%ebx
  dec %ebx
  movl 4(%rdi, %rbx, 4), %eax   # copy the value to %rax
  movl $0, 4(%rdi, %rbx, 4)     # zero the value
  mov %ebx, (%rdi)              # decrease count

### the tower in %rdi, the val in %rsi
  movl (%rdi),%ecx
  mov %rsi, 4(%rdi, %rcx, 4)
  inc %ecx
  mov %ecx, (%rdi)

That's pretty much it. Most of the rest of the code is logic for printing out the towers, and there's too much boiler plate to include here. The core of it is this method:

  mov %rdi, %rsi
  movl $1, %eax                 # SYS_write
  movl %eax, %edi               # fd = 1
  movl %eax, %edx               # n = 1

This prints the 1-byte buffer at %rdi. To convert ring values to characters I add the value of '0', same as in the C code. Printing a tower involves calling a loop with print_one pointing to a ring and then print_one pointing to a space. After the loop I call print_one with it pointing to a newline. The same trick of checking if there is an even or odd number of rings is needed in the print_all method, and I do that by checking the value in %(rbp) which is the same as when the program started.


GDB Stuff

It actually took me quite a while to get the tower initialization and printing code up, which is the first code that I wrote. Since I didn't have the printing code working the only recourse I had for debugging was using GDB. I ended up staying in GDB for pretty much the whole writing/debugging process and learned a few things.

The first piece of advice I have is to use the layout regs mode in GDB when debugging assembly. It makes it really easy to singe-step and figure out what's changing when. I made a few early mistakes with misunderstanding which registers would be clobbered in a few situations, and the layout regs mode made it possible to solve this problem.

Another thing I found useful with GDB is inserting function labels at various places. You can put a function label "inside" another function and the code will fall through when calling the original function. This makes getting break points at specific parts of assembly routines really easy: insert a label with any name, and then set up a GDB break point at that label. The only downside I found to this is that it makes disassembly a bit strange, since if you disassemble the original function the disassembly output stops at the label you created.

I got to use the reverse debugging mode in GDB for the first time while writing this, to debug a segfault. The idea is that you record at some point before your program crashes. Then you can use reverse-stepi and reverse-stepn to backtrack from where the segfault happens. My segfault ended up being due to a misadjusted stack pointer when executing a ret instruction.

Code Lessons

I kind of made up my own calling convention as I wrote the program. It's similar to the regular x86 C calling convention but in some situations I decided to clobber extra registers when I knew the callers didn't need them. That works for a program that's really small like this, but it's not ideal. And it won't work at all if you need regular C interop.

In retrospect the program would almost certainly have been simpler had I just followed the C calling convention exactly.

I also neglected to use frame pointers, and I think that would have also made debugging simpler. Certainly the GDB output would have been more understandable.

I found the 32-bit/64-bit stuff really difficult to get right. A number of times I made mistakes where I used 64-bit values in 32-bit contexts and ran into problems. I still don't fully understand the rules on when I can use the 32-bit forms and when I have to use the 64-bit forms. This is something I need to come back to.

Next Steps

I plan to compare the code that GCC generated at optimizations levels -O0 and -O2 to my own code to see where I could improve. My initial thoughts are "probably everywhere" since even as I was writing this code I recognized a bunch of areas where I was doing redundant loads. I'm also going to start thinking of another program to do in x86, hopefully something more complicated with more memory management.

Go Compiles Infinite Loops Very Efficiently

This Go compiler will compile this code:

func init() {
    goto start


0000000000401000 <main.init.1>:
  401000:       eb fe                   jmp    401000 <main.init.1>

Declare Your Bash Variables

Everyone knows that "regular" Bash variables are just strings. Bash doesn't really know about integers or other numeric types. Right?

Not really. You can "declare" an integer variable like this:

declare -i foo

This makes foo into an integer type variable! Here's a more concrete example, showing what an integer type actually does:

# declare our variable as an integer type
declare -i foo

# assign 2 to foo and print it
echo $foo              # prints 2

# what happens when we assign a string?
foo="hello, world"
echo $foo              # prints 0!

As you can see, once foo is declared as an integer variable it can no longer be coerced to a string type. There are other nifty things you can do with declare, such as the usage of declare -r for "readonly" variables. And more options are in the manual!

Bash scripting is notoriously difficult to get right. If you take the time to use the declare feature it can help you keep your Bash scripts readable and correct.


As software engineers we all know about localhost, the hostname for the local machine. As we all know this is the same as, right?

Not so fast.

First, does means localhost, but it actually means something more specific: IPv4 localhost. Of course there's another localhost for IPv6 and that's ::1. So we know that means localhost, and we know that ::1 means localhost. But what does localhost mean? Well, it could refer to either the IPv4 loopback address or the IPv6 loopback address.

Here's what's in /etc/hosts on my machine (Fedora 24):   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

As you can see localhost matches both and ::1. So what are the rules for what glibc will do in this situation when resolving localhost? What are the rules for what Python will do when connecting? Will getaddrinfo(2) prefer the IPv4 address or the IPv6 address, and will it even be consistent? As you can see there's a lot of ambiguity here.

Ideally all of your software works seamlessly with both IPv4 and IPv6 and this isn't an issue. But if you rely on using localhost know that there is ambiguity, and that you can't rely on using one protocol or the other.

The other issue to be wary of is how software represents remote addresses. Most software programs will happily resolve localhost when making outgoing connections, but typically they won't try to map an incoming address of or ::1 back to a human readable name like localhost. So if you're used to using localhost there's going to be an asymmetry, where your logs are filled with IP addresses but your configs use "localhost". This can lead to nasty surprises in unexpected situations, such as having a Varnish rule that matches against the regex ^localhost fail because Varnish represents the remote peer name as

I recommend that you pick either or ::1 and stick to it consistently. By being explicit about the protocol and using the same format that the machine uses when displaying socket addresses you'll run into a lot fewer surprises.


I've been writing a lot of Go recently at work, and I made an interesting tool for navigating my Go projects. The problem I ran into is that Go enforces a rigid directory structure that is very nested. I have my GitHub projects in ~/go/src/github.com/ORG/REPO, my work projects in ~/go/src/my.work.place/WORKORG/REPO, and various other more complicated directory layouts for projects cloned from other places. Remembering where I had cloned things was a challenge and a lot of typing.

I wanted to be able to type something like gocd foo and then automatically be taken to the right project directory for foo, wherever that was. My first attempt at this was in Bash and looked like this:

gocd() {
    local bestdir=$(find "${GOPATH}/src" -type d -name "$1" | awk '{print length, $0}' | sort -n | head -n 1 | cut -d' ' -f2-)
    if [ -n "$bestdir" ]; then
        cd "$bestdir"
        return 1

This finds the shortest matching directory in my $GOPATH and worked great. The problem is that it also explores .git/ and vendor/ directories. This isn't a problem when my filesystem caches are warmed up, but when they're cold (e.g. after booting) this command runs slowly if there are a lot of repos checked out.

I had the idea to rewrite this in C++ with the following design constraints:

I coded this all up and it's indeed very fast. I can find a project in under 20 milliseconds with my filesystem caches completely cold, compared to more than one second before. When the filesystem caches are warm the command completes in one or two milliseconds.

The current implementation uses a std::vector<std::string> of directory candidates, where the vector is sorted by string size. This allows finding the best candidate efficiently (it's at the back of the vector). For the same reason each candidate can be removed from the search list efficiently, since it just requires popping the back element from the vector. In theory this could be made even more efficient by either using my own custom merge sort (since I have one sorted vector and one unsorted vector), or by using a min-heap keyed on string size. In practice I'm not sure that either of these would actually improve things since my current solution correctly optimizes for minimizing disk I/O which is by far the most expensive part of the search.

To actually use this I install the C++ executable as _smartcd and then have a Bash wrapper called gocd() to tie things together:

gocd() {
    if [ -z "${GOPATH}" ]; then
        return 1;
    local best=$(_smartcd "${GOPATH}/src" "$1")
    if [ -n "$best" ]; then
        cd "$best"
        return 1

If you find this type of thing useful you can find it on my GitHub at github.com/eklitzke/tools. There are some other CLI tools in here written in C++ that I use for other things in my Bash prompt too. As with all my other projects, all of the code is licensed under the GPL.

Using ssh-agent and ED25519 Keys on GNOME

The default GNOME keyring (gnome-keyring-daemon) provides an inferior version of ssh-agent. The GNOME version doesn't support ED25519 keys, and it doesn't have proper support for removing keys. I figured out a kind of elegant way to fix this in Fedora 24 which should be compatible in the future with the bright and glorious Wayland future, including with gdm-wayland-session.

First I disabled the SSH component of gnome-keyring-daemon by removing the Autostart lines from /etc/xdg/autostart/gnome-keyring-ssh.desktop. From a clean install of GNOME/Fedora you should see after doing this that upon login SSH_AUTH_SOCK is no longer set.

Next I created a systemd user unit file at ~/.config/systemd/user/ssh-agent.service with the following content:

Description=OpenSSH private key agent

ExecStart=/usr/bin/ssh-agent -a $SSH_AUTH_SOCK
ExecStartPost=/usr/bin/systemctl --user set-environment SSH_AUTH_SOCK=${SSH_AUTH_SOCK}


I also enabled this unit with systemctl --user enable ssh-agent.service. After doing this, upon logging in to a GNOME session you should see that SSH_AUTH_SOCK is still not set, but now you will see that an ssh-agent process is started with a command like /usr/bin/ssh-agent -a /run/user/1000/ssh-agent.socket. You'll also see that systemctl --user show-environment has the correct value for SSH_AUTH_SOCK.

Finally I put this in my ~/.bash_profile:

eval $(systemctl --user show-environment | grep SSH_AUTH_SOCK)

This will cause the right value for SSH_AUTH_SOCK to be propagated to your X11 session. This works because gdm-x-session sources ~/.bash_profile when logging in (or at least it does on Fedora).

From what I understand from the GNOME bugzilla, gdm-wayland-session will automatically know how to get SSH_AUTH_SOCK from the systemd user session, I believe starting in GNOME 3.22. This means in the future you won't need these lines in ~/.bash_profile.

Systemd Socket Activation

I recently wrote my first service that has (optional) systemd socket activation, and I have to say: I really like it.

It's really easy to use, and it makes writing services with graceful restarts really easy. In the simplest usage you add a small amount of boilerplate to your code:

int listen_sock;
int fd_count = sd_listen_fds(0);
if (fd_count == 1) {
  listen_sock = SD_LISTEN_FDS_START;
} else {
  // set listen_sock w/ usual bind/listen stuff here
struct sockaddr addr;
socklen_t addrlen;
while (int client_sock = accept(listen_sock, &addr, &addrlen)) {
  // accept loop here

The only code here that's different is that we try to sd_listen_fds() before doing bind()/accept(). Then if the process has been started in socket activation mode the process will start with a ready listen socket. There's also some systemd configuration you have to write to use this; namely you'll have to create a short file name myservice.socket that has the listen address, and then:

systemctl enable myservice.socket
systemctl start myservice.socket

Just with this code things are already an improvement over usual because when you restart the process (say, because you're upgrading the code) the listen socket will be created by systemd as soon as the old process exits, meaning that there's a shorter gap when no listen socket is available to clients.

There's also an easy way to do completely graceful process restarts using sd_notify_with_fds(). Briefly, the way this works is you call this while your process is exiting you use this API call to hand file descriptors (along with their "names") back to systemd. Then when the new process starts up it can request those same file descriptors back with sd_listen_fds_with_names(). You can control just the listen socket this way (so there's no gap whatsoever where the listen socket is closed), and you can also use this mechanism to transfer client sockets: it's up to you. If you want to transfer client sockets you'll have to come up with your own serialization format to send any connection state over. Systemd has some helpers for this too.

There have always been ways to do this: you've been able to transfer file descriptors on Unix systems over AF_UNIX systems for eternity. But systemd makes this process much easier, and does it in a way that's integrated with init which is really nice.

Esc is Ctrl-[

I've been a user of vi basically since I started programming.[1] The problem that every vi user has is that on most keyboards the Esc key is in an inconvenient location. I tried a number of tricks like remapping Esc and Caps Lock, but wasn't super happy with any of them.

Then one day I learned that Esc generates the same key sequence as Ctrl-[. Since that day I've abandoned the Esc key and solely rely on Ctrl-[. The [ key is already convenient enough for me (and one I have to type a lot anyway). Since I don't need Caps Lock I typically make mine into an extra Ctrl key. This is great because it gives me a whopping three Ctrl keys (be still, my beating heart). The Ctrl-[ sequence is easy to type since it uses both hands.

I don't know of anyone else in the habit of using Ctrl-[ this way, but I quite like it. If you're a vi user give it a shot, you might find it suits you too.

[1] This is actually a half-truth—I've been using Emacs with evil for a number of years now.

F-Market Line Considered Harmful

Over the last few years the city of San Francisco has done a lot of work to make the city more bicycle friendly, for which I sincerely applaud them. Initiatives like Vision Zero and the good work from the SF Bicycle Coalition have made a huge, positive impact on cycling in San Francisco. However, there's still a lot of work to be done, and in my opinion removal of the train tracks on Market St should be at or near the top of that list.

Light rail train tracks pose a huge problem to cyclists. The tracks are slick metal with grooves just wide enough to fit bicycle tires. A cyclist crossing the tracks with an insufficient angle will have their wheel caught in the tracks which will cause them to crash. This is especially dangerous when the front wheel is caught in the tracks (the most common case for unwary cyclists) since it will typically result in the handlebars suddenly twisting and getting stuck which will throw the cyclist over the handlebars and the front of the bike, frequently head first. I've witnessed a lot of accidents due to train tracks, and crashed myself due to them.

While light rail tracks can be dangerous to cyclists, light rail is also typically very convenient for commuters. There are two factors that make light rail especially convenient, for instance compared to buses. The first is that in situations where the light rail is underground or elevated, the trains can bypass normal street traffic. The second is that light rail trains can be easily made quite long, which give them greater carrying capacity than buses. There are also a number of things cities can do to mitigate the danger that light rail poses to cyclists. The best solution is to run trains below ground or on elevated platforms, as this puts the trains completely out of the way of cyclists. This has the added benefit of making the trains run faster, as already mentioned. When this isn't possible the next best thing is to avoid putting train tracks parallel to where cyclists are likely to be, and ideally to create convenient cycling corridors on adjacent streets to encourage cyclists to be out of the way of these train tracks.

San Francisco's F-Market & Wharves line (henceforth abbreviated "F-Market"), also known as the "Historic Streetcars" line, is a great example of all of the worst factors of light rail convenience and bicycle safety combined. The F-Market runs from 17th & Castro down the length of Market St to the Embarcadero, and then north someways along the Embarcadero towards Fisherman's Wharf. The line operates a "heritage streetcar service", meaning that the trains used are vintage streetcars from light rail systems around the country.

My main criticism of this line is that for nearly its entire route down Market St the train runs parallel to and isn't separated from traffic or cyclists. Technically the tracks don't enter the bike lane, but for most of its length the bike lane runs right alongside it. Cyclists regularly need to cross the tracks in routine traffic, for instance to maneuver around stopped vehicles. I have to imagine that these tracks account for the most dangerous hazard to most cyclists in the city, simply because of their placement and because of the fact that Market St is so heavily trafficked by cyclists. There are other light rail lines in the city that aren't separated from traffic (e.g. the N Judah line), but those lines aren't on major cycling corridors and therefore in practice are significantly less dangerous to cyclists.

Furthermore the F-Market line isn't really convenient to anyone, at least not on the stretch down Market St. There are subway tracks underneath Market that serve all of the "Muni Metro" lines, the other light rail system. That is, the subway under Market St serves Metro lines J, K, L, M, N, and T. The Metro lines are much faster because they are underground and therefore avoid traffic. The Metro trains are also typically of much higher capacity since they use standard train cars instead of repurposed "heritage" cars. If the city were to restrict the F-Market line solely to its route along the Embarcadero (where it is faster and safer since it is separated from other vehicle traffic) very few people would be inconvenienced.

What makes this even more infuriating is that there are a lot of regular Muni bus lines already on Market St. The buses are actually cheaper to operate than light rail anyway. People who don't have convenient access to the Metro system or are traveling to a part of the city not served by one of these lines are already taking the other Muni buses operating along Market St. In my mind there is almost no practical reason for the F-Market line to operate on Market St at all. The only function it serves is for the aesthetic appeal of the historic streetcars. But this comes at a grave cost to cyclist safety.

If San Francisco is serious about improving cycling convenience and safety I implore them to consider restricting the F-Market line to its Embarcadero stretch, and to pave over the tracks on Market St.


Joey Hess has a collection of general purpose Unix utilities called moreutils. This is available on pretty much every Linux distribution I know of under the unsurprising package name moreutils. There are a few good things in here, but my favorite by far is the sponge(1) command.

Explaining what sponge is and why it's useful is easiest with an example. Suppose that you've done some complex shell pipeline and redirected the output to a file called file.txt. You realize that you've accidentally included some extra lines in the file, and you want to remove them using grep. Conceptually what you want to do is:

# incorrectly try to remove lines from file.txt
grep -v badpattern file.txt > file.txt

The issue is that this doesn't work—or at least, it doesn't do what you might expect. The problem is that when you use the > operator in bash it immediately opens the file for writing, truncating it. This happens before the first part of the pipeline runs: in this case, before the grep command is invoked. This means that when grep runs it will see file.txt as an empty file and consequently this shell invocation will always leave you with an empty file.

One way to fix this is to use an intermediate output file, perhaps like this:

# correct version, using a temporary file
grep -v badpattern file.txt > tmp.txt
mv tmp.txt file.txt

This works because the output is fully written to a new file before replacing the input file. But it's two steps, and if you're like me you're likely to end up with a lot of random semi-processed files that you need to clean up later.

There's a better way to do this using sponge. You do it like this:

# correct version, using sponge
grep -v badpattern file.txt | sponge file.txt

What happens here is that sponge will buffer all of the data it gets on stdin into memory. When it detects the EOF condition on stdin it will then write all of the data it buffered to a file named by its argument. By convention you would use the input file as this argument. The end result is that file.txt won't be truncated until after it's been fully read by the left hand side of the pipe. The only caveat to be aware of is that because the output is first buffered into memory, you may run into problems if the output file is too large (i.e. larger than the amount of free memory you have). However I've very rarely found that to be the case, and I'm a happy regular user of sponge.

5-HTP Contraindications

I recently came across a really interesting article called "5-HTP efficacy and contraindications" on the NIH website. I found the article because I had been orally supplementing my diet with 200mg 5-HTP capsules, and I wanted to do some more research on what evidence there was for the efficacy of 5-HTP. I found the article to be really surprising, and I want to summarize it here. Of course if you have any questions, or want to know more about this, it's better to just read the article itself rather than my summary here. Disclaimer: I am not a doctor or medical professional.

5-HTP is a precursor to the neurotransmitter serotonin. Serotonin is a critical neurotransmitter that plays a number of functions in the brain, among them contributing to feelings of well-being and happiness. Low serotonin levels have been linked with depression and other mood disorders (although not all cases of depression are caused by low serotonin levels). Serotonin does not cross the blood-brain barrier, which means that serotonin levels in the brain cannot be increased by direct supplementation. However, 5-HTP does cross the blood-brain barrier. In the brain 5-HTP is metabolized by an enzyme called AAAD into serotonin. Therefore oral supplementation of 5-HTP can increase latent serotonin levels in the brain. The pop science marketing of 5-HTP supplements suggest that by taking 5-HTP you can improve mood and happiness, and possibly even counteract the effects of depression, since taking 5-HTP can increase serotonin levels.

What the study I linked to shows is that not only is there poor evidence that 5-HTP supplementation alone can improve mood or combat depression, there's actually a lot of circumstantial evidence to suggest that it can worsen those effects. That is what is meant by the medical term contraindication—that 5-HTP supplementation can potentially have a deleterious impact on mental health.

The article first summarizes some of the existing studies on 5-HTP supplementation. It points out that studies where 5-HTP was administered alone have not shown significant improvement for treating disorders like depression. Then it points out that in studies where 5-HTP supplementation has shown to be effective, that those studies were also controlling for dopamine levels with another drug like carbidopa. From what I gathered from the article, the current state of the art in medicine allows doctors to measure serotonin and dopamine levels and to control them carefully in a controlled setting. However, over the counter oral 5-HTP supplementation isn't a controlled or monitored setting, and without the administration of a drug like carbidopa one isn't controlling latent dopamine levels.

The article goes on to point out that there's a lot of evidence to suggest that oral 5-HTP supplementation can actually cause dopamine levels to become imbalanced. The synthesis of 5-HTP into serotonin is catalyzed by the enzyme AAAD, and that same enzyme catalyzes L-DOPA into dopamine. Excess 5-HTP can compete with the L-DOPA reaction and therefore cause dopamine levels to be depressed. There's evidence to show that other monoamine pairs can get out of balance when one monoamine precursor dominates another. Low dopamine levels are associated with a number of disorders, notably Parkinson's disease. Low dopamine levels have also been implicated in depression. Therefore there is evidence to suggest that taking 5-HTP to increase serotonin levels can actually cause lowered dopamine levels, and therefore can potentially be implicated in worsening symptoms of depression.

On the basis of this article it doesn't seem safe to me to take supplementary 5-HTP for extended periods of time, or possibly at all.

Documenting Code

I'm going to write about something a little different today. In this article, I want to discuss how I think code should be documented internally in corporate environments. That is, suppose you're an engineer working at Cogswell Cogs. I want to talk about how I think documentation should be done at Cogswell Cogs.

There are three obvious choices for how this can be done:

How should things actually be done?


I believe that Wikis are the worst option available for documentation.

The problem inherent to wikis is that they're typically edited via a tool in the browser, typically via an HTML textarea or with a WYSIWYG editor. There's nothing wrong with textareas or WYSIWYG editors. What is wrong is that the source-of-truth for wiki pages is typically not checked into the source code repository. This means that code review is unlikely to include documentation review. This means that when grepping the code for occurrences of a string, engineers aren't likely to see occurrences in the documentation.

These reasons are why it's fundamentally so difficult to keep wikis up to date with code. It's hard enough to get engineers to document their work in the first place; it's harder still to get them to do it when documentation isn't a part of the normal development workflow.

Generated Documentation

Documentation generation tools like Sphinx, Godoc, Javadoc, Doxygen, etc. are great tools that produce superb documentation. A lot of them have "autodoc" capabilities, a term that's used to describe documentation that is automatically generated by tools that actually analyze the source code and metadata in the code (typically comments formatted in a certain way). This is a powerful feature. Most of the high quality documentation you see for polished software projects is generated using tools of this category. This is also how big companies like Intel, Apple, and Microsoft generate documentation for their external APIs.

If you have the energy and wherewithal to maintain documentation in this format, I highly recommend it. However, I would also add that a lot of people don't have this energy. It's not uncommon for engineers to start with an initial burst of energy where they write a lot of great documentation, and then the documentation becomes out of date. Of course, the situation here is better than with a wiki, for the reasons I described earlier. But it's still a problem that has to be watched out for.

The main problem I see with these generated documentation tools is that there's a somewhat high learning curve to them. Once you're over the curve they work great, probably better than any other tool. But the curve is still there. The problem I've seen personally is that it's hard to maintain this documentation in a company that's growing quickly. You start with some engineers who are passionate about writing great documentation and doing the right thing, who are willing to overcome this learning curve. A year later, when the team has grown and new engineers (or perhaps engineers from other teams) are contributing to the documented projects, the newcomers may not understand the documentation format and may not keep it up to date. That's why I think if you use one of these tools it's imperative to be rather strict about educating engineers on how to use these tools. If half your engineers don't understand how to use a tool like Sphinx then half the code contributions won't include documentation updates, and this will lead to out of date documentation.

Another pitfall you can run into with these tools is that in some cases the way that documentation and code is mixed can be confusing. If you're using autodoc style documentation (where documentation is generated from code metadata and comments) then the documentation is difficult to grep, since grepping the docs requires grepping code as well. If you're putting the docs outside of the code, in a dedicated directory, then the opposite problem is the case: it's easy for engineers to miss that directory. The reason is that if you have your docs in a dedicated directory (say, docs/), that directory is outside the regular code flow and therefore is easily missed by people navigating code in their text editors. For this reason, if you use generated documentation tools I think it's critical to have a good search tool set up internally. Engineers relying on command line tools like "grep" are going to miss docs either way that you configure things, so if you don't have a good internal search engine set up then people are going to have difficulty finding docs.

The last issue here is related to the fact that if you work at a company that maintains a lot of separate services or projects it's likely that some of those services or projects will go undocumented (simply because there's so many of them). This can create a negative cycle where engineers go to the internal documentation portal, fail to find something documented, and then start assuming in the future that other projects will also be undocumented. This causes people to stop using the internal documentation portal—even in cases where there is documentation! In other words, if there's any lack of consistency here then it can become a big trust issue. This is not an insurmountable problem, but it's one to be aware of. Again, good internal search tools can help here, since a good search tool will quickly become a source of truth for people.


The last and most primitive method you can use is a file checked into the top level of a project with a name like README, README.md, or README.rst. While this method is primitive, I'm actually a huge fan of it, and I think it's underappreciated.

The README file convention has been around since at least the early days of the internet; you'll see it, for instance, in old source code tarballs form the 1980s. However, in my mind it's really been popularized by GitHub, which also allows you to check in Markdown or reStructuredText files. On GitHub these Markdown or reStrucutedText files are automatically turned into good looking HTML when you browse a project page. This same convention has been copied elsewhere. For instance, if you use Phabricator it will automatically generate HTML documentation for projects based on a checked in README.md or REAMDE.rst file at the root of a directory.

This convention used by GitHub and Phabricator makes it dead easy for engineers to get good looking docs. There's literally no configuration necessary—just drop a file with the appropriate name into a directory. It's really easy to get engineers to create these files, because the semantic overhead is so low. There's almost nothing to learn; certainly less, in any case, than learning a tool like Sphinx. Because this method is so simple, it's a lot easier to get people to use it.

Because the README file convention is to put the file at the root of directories (typically one at the project root, and occasionally in subdirectories as well) it's impossible to not notice this file. Engineers will have to see it as they navigate the source code.

Typically the formatting and linking capabilities of a README file are not as extensive as what you'd have with a tool like Sphinx or Doxygen, but you can do a pretty good job. GitHub and Phabricator support features such as syntax highlighting, tables, inline images, and intra-document links, which means that you can actually do quite a lot with this simple documentation format.

If you use README files you don't really need to have a documentation search solution (although it doesn't hurt if you do have one). The reason is that engineers will already be in the habit of looking at code repositories for code they're using, and therefore will have the expectation that they will find documentation alongside the code, in a predictable location.

The prevailing argument I have here is that worse is better. README files are dead simple and rather limited—but that same simplicity makes these files much easier to get engineers to actually use and keep up to date.


Don't use wikis to document code. Wikis can work well for other things, but if you ask engineers to keep code documented on a wiki you'll find that the wiki quickly becomes out of date and misleading.

Documentation generation tools like Sphinx can produce beautiful, high quality documents. But beware of the steep learning curve: it can cause some engineers to not document their code! If you do use a documentation generation tool, make sure that you have strong internal training practices to get new engineers up to speed on how to use the documentation tools. Likewise, make sure you've thought of how engineers will search the documentation, and make sure the search works well.

README files can be a good, practical alternative to generated documentation. Because they're so simple everyone will understand how they work, where they live, and how to edit them. Modern formats like Markdown and reStructuredText mean that you can use README files and still get beautiful generated HTML.

Fedora, QEMU, and 9p Filesystem Passthrough

Previously I wrote an article about configuring NFS and autofs for use with a VM on Linux. The premise of that article is that you want to run a local VM, and you want to share files between the VM and the host machine. I described a mechanism for doing this by setting up an NFS server on the guest VM, and then using autofs on the host to automatically mount the guest filesystem as necessary. This does work, but I've since learned that it's not at all the best way to configure things.

First, I've realized that it's better to have the host OS also be the host for the filesystem. That is, the VM should mount the host filesystem, not the other way around. There are two reasons this is better:

The other thing I learned is that there's this thing called 9p filesystem passthrough, which allows you to "directly" expose the host filesystem to the guest. I put "directly" in quotes here because the filesystem still needs to be mounted using a non-local driver, in this case using the 9p protocol. The difference here, compared to NFS, is that on Linux 9p has a native virtio driver which means that all of the filesystem operations are directly virtualized by the hypervisor. This means that everything is fast and there's no weird network protocols (like NFS): everything is directly exposed by the kernel from the host to the guest.

Setting Up 9p Passthrough On Fedora

There are already a bunch of guides for configuring 9p passthrough on Linux, and I found that they all work except one step on Fedora, which is why I'm writing this post. The short version is that using virt-manager you expose a "filesystem" device and for the "source path" you point it at the directory on the host you want to share, and then for the "target path" you set up a unique tag like shared that you will be the name of the share. Then on the guest you mount the filesystem using the 9p mount type and using the options -o trans=virtio,version=9p2000.L,rw. I found this illustrated guide to be helpful; it even has pictures! If you don't want to use virt-manager, just Googling "qemu 9p passthrough" should yield plenty of results explaining the relevant qemu command-line options.

Now once you do all of this, on Fedora you'll run into a problem. If you did everything correctly you'll find that while the guest can read files, it won't be able to write to them, regardless of what fsdev security model you use. In other words, the mount will act like a read-only mount, even if you have it mounted read-write and have the right user ids and directory permissions and whatnot. More concretely, any system calls that open files for writing will fail with an EPERM error.

After a lot of head scratching I found this virt-users mailing list email about the issue. The issue is that by default on Fedora libvirt will arrange for the QEMU process to be run as a special, unprivileged user: the qemu user. The problem is that if you're trying to mount files in your home directory, the qemu user won't have the correct permissions to access those files. What you want, instead, is for the QEMU process to run as your own user id. This is easy to set up. Edit the file /etc/libvirt/qemu.conf and find the commented out lines for user and group, and make them look something like this:


Of course, the actual user and group on your system will probably be different (unless your Unix login happens to be YOUR_USERNAME_HERE).

After you make this change you'll have to restart any active QEMU processes, e.g. by restarting any guests you have in virt-manager. Then confirm that in the process listing the QEMU process is running as the expected user id, e.g. by examining ps -ef | grep qemu. If that looks OK, you should be able to log into the guest and write to files!

I hope this article helps save a fellow Fedora user some time, since it took me quite a while to hunt this down.

Inline C/Assembly In Bash

Recently I had the idea of trying to combine two of my favorite things: bash and assembly. My original idea was to try to make a way to inline assembly code into bash scripts. While I was researching this project I discovered Tavis Ormandy's ctype.sh. This is a really clever project that makes use of the nearly undocumented enable -f feature in bash to add new shell builtins that expose the functionality provided by dlopen(3). The project is extremely ingenious, and I would recommend browsing the code to anyone who interested in these kinds of low level things.

At first I wanted to try to extend bash in the same way that Tavis did to provide new builtins for inlining assembly code. It occurred to me, however, that it's not actually necessary to do this as long as the assembly can be built into a shared object. Furthermore I realized that because of the lack of a real type system in both assembly and bash that trying to actually inline assembly into bash would be really difficult for all but the most trivial functions. This is because bash primarily interacts with string data, and doing string manipulation and memory allocation in x86 is not fun. Tavis' code simplifies a lot of this already by implementing a type marshaling system. Plus it's already possible to inline assembly into C via the asm keyword, so by simply coming up with a way to inline C code I'd also be able to inline assembly as well.

Fortunately bash already has a syntax feature that makes writing "inline" code feasible: here documents. Here documents are a useful way to write multi-line literal statements without having to do lots of escaping. Using this feature we can add "inline" C code to our bash script, and then generate a DSO with a C compiler (both GCC and Clang shoudl work). The DSO can be loaded by ctypes.sh, and then our code will run.

I have an example demonstrating this here: github.com/eklitzke/c.sh. In this example I have written a "hello world" function in C and a trivial implementation of popcnt (which counts the bits in a word) in inline assembly. Hopefully the code should be pretty straightforward to follow.

I have some more ambitious ideas for this project. I'd like to try to embed the Python interpreter into a bash process. This would allow one to write "inline" Python code in a bash script, and then the Python functions defined would be executable from the context of the bash script.

A Funny Thing About C and C++

I recently read an interesting paper about the difference between the ISO C standard and C as implemented by compilers, and it got me thinking more about how people learn C and C++. Most languages are very complex, and for most questions that developers have it's impractical to try to find answers in the actual language specification. To start, it's frequently impossible to answer questions using language standards because it may not be possible to find the answer without knowing the correct terminology. Even in cases where one does know the terminology to look for, many language standards are so verbose that consulting them is impractical for a non-expert.

Instead, what most developers do when they have a question about a language corner case is to write a program that exercises the corner case and then observe what the program actually does. Most languages have a de facto reference implementation, and typically the reference implementation mirrors the specification almost exactly. For instance, if you have a question about how an obscure aspect of Python works you can almost always test a program using CPython to understand the dictated behavior. There are a couple of obscure things that are implementation-specific (e.g. what is allowed with respect to garbage collection), but in the vast majority of cases this approach works fine. This approach also works well with other languages like Ruby, PHP, Javascript, and Go.

C and C++ are very different. There is no de facto reference implementation of C or C++. There are a lot of different C/C++ compilers out there, and unlike many other languages the C and C++ standards are frequently finalized before any compilers actually fully implement the new semantics (this is particularly true for C++). Additionally there is a huge amount of "undefined behavior" allowed by the language specifications. Therefore when you write a test program in C or C++, you can't be sure if the observed behavior you get is actually part of the language specification or simply the behavior of a particular compiler. The problem is compounded by the sheer complexity of the language specifications. The last time I looked, the ISO C standard was 700 pages long. That's just C! I can't even fathom how many pages the C++ standard must be, if it even exists in a single document.

Another interesting thing about C and C++ is that for real-world programs to execute you must link them. Most of the details of linking are not specified by the language standards. This is extremely evident if you look at complex C or C++ code that attempts to be cross-platform compatible between Windows and Unix—generally there will be a plethora of preprocessor macros near function definitions to ensure that methods have similar linkage semantics on the two platforms. In a number of cases there are linking behaviors available on Windows but not on Unix, and vice versa.

What most people learn when they write C or C++ is how a particular compiler works, not how the languages are actually specified. This is why writing portable C or C++ code is so challenging. It's no wonder there are so few true experts out there in the C and C++ communities.

CRCs vs Hash Functions

In this post I am going to address some misconceptions on the difference and utility of cyclic redundancy checks (CRCs) and generalized hash functions, including cryptographic hash functions.

We'll start with an explanation of what the two are. Both CRCs and hash functions share the property that they take an input value and reduce it to a (usually shorter) output value. Typically inputs can be arbitrarily many bits in length, and the output is a small value, frequently in the range of 32 to 512 bits. By convention the output value for a CRC is called a "checksum", and the output value for a hash function is called a "digest".

CRCs are a type of error-detecting code used to implement checksums. CRCs are specifically designed to satisfy the property that they can detect transmission errors in data. The idea is that you're transmitting a signal over a potentially lossy medium (e.g. Ethernet or radio waves) and you want to detect any data loss or data corruption. The CRC is computed over the data payload and sent as part of the data transmission. The receiver can recompute the CRC and compare it to the received value to detect errors. No property of the CRC matters other than whether it differs when data corruption occurs. It isn't important if the CRC is biased in any way as long as transmission errors result in differing CRCs.

In general a CRC can be n-bits in length, but most commonly a 32-bit CRC is used. An n-bit CRC has the property that any error run of up to n bits is mathematically guaranteed to produce a distinct CRC checksum. This guarantee is possible because CRCs are computed using a special type of polynomial that is able to guarantee this property. As a special case, this typically means that an error of a single bit (or a single byte) will guarantee that the CRC checksum will fail, regardless of any other property of the transmitted data, including its length. This is a powerful property, since it guards against the most common type of transmission errors.

General purpose hashes, including cryptographic hashes, are a little different. Again, they take an input value and reduce it to a (typically smaller) digest value. Hash functions optimize for a different property, however. Hashes try to optimize for the case of reducing bias in the output digest value. That is, hashes should attempt to have unbiased output values even when the input is biased. Hash functions also try to optimize to reduce hash collisions for differing input values, but there are usually no guarantees made on what conditions can lead to a hash collision other than probabilistic ones.

It's inappropriate to use a CRC in place of a general purpose hash function because CRCs usually have biased output. It's equally inappropriate to use a general purpose hash function in place of a CRC because general purpose hash functions usually do not make any guarantees on the conditions under which hash collisions can occur.

Most modern cryptographic hash functions have very large digest values compared to what is typically used in a CRC. For instance, the most widely used CRC that I've seen used is CRC-32 which has multiple variations, but all of which produce a 32-bit checksum value. By contrast MD5 has a 128-bit digest value and SHA-1 has a 160-bit digest value. More modern cryptographic hash functions frequently have even larger digest sizes. Accordingly CRC values have somewhat fallen by the wayside in many modern applications. However, it's still important to understand how the two are different. In particular, it's probably a bad idea to use a hash function for a CRC if you are going to reduce the size of the digest to the same size as a CRC checksum anyway. It's also worth noting that the CRC algorithm can be parameterized by larger values for n, so tit's also possible to produce large CRC values such as CRC-128 (or higher).

GitHub Private Repos

I've been a paying customer of GitHub pretty much since the site launched, when I signed up in March 2008. Back when I signed up I was working on a number of projects that I wanted to keep private, including a few where I wanted to collaborate with other people. That has changed over the years. When I recently reviewed what repos I still had private, I found that the only ones I still cared about were a handful of things like my blog and my dot files that I was developing on my own (i.e. without collaborating with others).

For a non-corporate entity like myself I found that there are really only a few value adds that GitHub is providing:

I'm not really collaborating with others on the private repos I want to keep private; and if I were, I've found that I dislike the collaboration workflow that GitHub implements anyway. Hosting repos with git literally just requires having ssh access to the remote server and git itself installed, which I already have on the server running eklitzke.org. Doing secure backups to the cloud is trivial nowadays with S3 and GCP. So I decided to cancel my GitHub plan and just host my remaining private repos myself.

If you haven't ever hosted your own git repo before, here's what you do. On the server you want to host the git repo you do this:

mkdir ~/myproject.git
cd ~/myproject.git
git init --bare

Then in a checkout of your project (i.e. on the client) you do something like:

git remote add upstream myserver:myproject
git push upstream master

That's it. Now you can push/pull as usual, and to do a checkout you'd just do:

git clone myserver:myproject

Backups are also super easy. I'm using Google Cloud Storage, but you could just as easily use S3 or any other cloud storage service. Here's the actual script I'm using to backup each of the git projects I'm hosting in this way:


set -eu

if [ ! -d "$TARGET" ]; then
    echo "Cowardly refusing when $TARGET does not exist."
    exit 1


function cleanup {
    rm -f "${BACKUP}"
trap cleanup EXIT

tar -cvJf "${BACKUP}" "${TARGET}"
gsutil cp "${BACKUP}" gs://MY-PRIVATE-BUCKET/git/

I have a cron job that runs this script daily for each of my repos. You could easily GPG encrypt the backups by piping the tar command to gpg -e, but I don't actually have any sensitive content in these repos so I don't bother.

One of the nice things about git is that even if my server goes down, restoring a git repo from a backup made in this fashion is trivial. Unpacking the backup tarball creates an actual git repo in the "bare" format, and converting from the bare format to the regular format is trivial (I would do it by passing the path to the bare checkout to a regular git clone invocation).

A few years ago I would have been happy to continue being a paying customer for GitHub just to support them as an organization, even if the value they were adding was small. Recently my thoughts on this have changed. Today I'm skeptical hat GitHub is providing a positive impact on the open source/free software movements, and I'm worried about the near monopoly status they have on new open source projects. This is a complex topic, so perhaps I'll flesh out my ideas on it more in a future post.