eklitzke.org

repz stosb

Documenting Code

I'm going to write about something a little different today. In this article, I want to discuss how I think code should be documented internally in corporate environments. That is, suppose you're an engineer working at Cogswell Cogs. I want to talk about how I think documentation should be done at Cogswell Cogs.

There are three obvious choices for how this can be done:

How should things actually be done?

Wikis

I believe that Wikis are the worst option available for documentation.

The problem inherent to wikis is that they're typically edited via a tool in the browser, typically via an HTML textarea or with a WYSIWYG editor. There's nothing wrong with textareas or WYSIWYG editors. What is wrong is that the source-of-truth for wiki pages is typically not checked into the source code repository. This means that code review is unlikely to include documentation review. This means that when grepping the code for occurrences of a string, engineers aren't likely to see occurrences in the documentation.

These reasons are why it's fundamentally so difficult to keep wikis up to date with code. It's hard enough to get engineers to document their work in the first place; it's harder still to get them to do it when documentation isn't a part of the normal development workflow.

Generated Documentation

Documentation generation tools like Sphinx, Godoc, Javadoc, Doxygen, etc. are great tools that produce superb documentation. A lot of them have "autodoc" capabilities, a term that's used to describe documentation that is automatically generated by tools that actually analyze the source code and metadata in the code (typically comments formatted in a certain way). This is a powerful feature. Most of the high quality documentation you see for polished software projects is generated using tools of this category. This is also how big companies like Intel, Apple, and Microsoft generate documentation for their external APIs.

If you have the energy and wherewithal to maintain documentation in this format, I highly recommend it. However, I would also add that a lot of people don't have this energy. It's not uncommon for engineers to start with an initial burst of energy where they write a lot of great documentation, and then the documentation becomes out of date. Of course, the situation here is better than with a wiki, for the reasons I described earlier. But it's still a problem that has to be watched out for.

The main problem I see with these generated documentation tools is that there's a somewhat high learning curve to them. Once you're over the curve they work great, probably better than any other tool. But the curve is still there. The problem I've seen personally is that it's hard to maintain this documentation in a company that's growing quickly. You start with some engineers who are passionate about writing great documentation and doing the right thing, who are willing to overcome this learning curve. A year later, when the team has grown and new engineers (or perhaps engineers from other teams) are contributing to the documented projects, the newcomers may not understand the documentation format and may not keep it up to date. That's why I think if you use one of these tools it's imperative to be rather strict about educating engineers on how to use these tools. If half your engineers don't understand how to use a tool like Sphinx then half the code contributions won't include documentation updates, and this will lead to out of date documentation.

Another pitfall you can run into with these tools is that in some cases the way that documentation and code is mixed can be confusing. If you're using autodoc style documentation (where documentation is generated from code metadata and comments) then the documentation is difficult to grep, since grepping the docs requires grepping code as well. If you're putting the docs outside of the code, in a dedicated directory, then the opposite problem is the case: it's easy for engineers to miss that directory. The reason is that if you have your docs in a dedicated directory (say, docs/), that directory is outside the regular code flow and therefore is easily missed by people navigating code in their text editors. For this reason, if you use generated documentation tools I think it's critical to have a good search tool set up internally. Engineers relying on command line tools like "grep" are going to miss docs either way that you configure things, so if you don't have a good internal search engine set up then people are going to have difficulty finding docs.

The last issue here is related to the fact that if you work at a company that maintains a lot of separate services or projects it's likely that some of those services or projects will go undocumented (simply because there's so many of them). This can create a negative cycle where engineers go to the internal documentation portal, fail to find something documented, and then start assuming in the future that other projects will also be undocumented. This causes people to stop using the internal documentation portal—even in cases where there is documentation! In other words, if there's any lack of consistency here then it can become a big trust issue. This is not an insurmountable problem, but it's one to be aware of. Again, good internal search tools can help here, since a good search tool will quickly become a source of truth for people.

README Files

The last and most primitive method you can use is a file checked into the top level of a project with a name like README, README.md, or README.rst. While this method is primitive, I'm actually a huge fan of it, and I think it's underappreciated.

The README file convention has been around since at least the early days of the internet; you'll see it, for instance, in old source code tarballs form the 1980s. However, in my mind it's really been popularized by GitHub, which also allows you to check in Markdown or reStructuredText files. On GitHub these Markdown or reStrucutedText files are automatically turned into good looking HTML when you browse a project page. This same convention has been copied elsewhere. For instance, if you use Phabricator it will automatically generate HTML documentation for projects based on a checked in README.md or REAMDE.rst file at the root of a directory.

This convention used by GitHub and Phabricator makes it dead easy for engineers to get good looking docs. There's literally no configuration necessary—just drop a file with the appropriate name into a directory. It's really easy to get engineers to create these files, because the semantic overhead is so low. There's almost nothing to learn; certainly less, in any case, than learning a tool like Sphinx. Because this method is so simple, it's a lot easier to get people to use it.

Because the README file convention is to put the file at the root of directories (typically one at the project root, and occasionally in subdirectories as well) it's impossible to not notice this file. Engineers will have to see it as they navigate the source code.

Typically the formatting and linking capabilities of a README file are not as extensive as what you'd have with a tool like Sphinx or Doxygen, but you can do a pretty good job. GitHub and Phabricator support features such as syntax highlighting, tables, inline images, and intra-document links, which means that you can actually do quite a lot with this simple documentation format.

If you use README files you don't really need to have a documentation search solution (although it doesn't hurt if you do have one). The reason is that engineers will already be in the habit of looking at code repositories for code they're using, and therefore will have the expectation that they will find documentation alongside the code, in a predictable location.

The prevailing argument I have here is that worse is better. README files are dead simple and rather limited—but that same simplicity makes these files much easier to get engineers to actually use and keep up to date.

Conclusions

Don't use wikis to document code. Wikis can work well for other things, but if you ask engineers to keep code documented on a wiki you'll find that the wiki quickly becomes out of date and misleading.

Documentation generation tools like Sphinx can produce beautiful, high quality documents. But beware of the steep learning curve: it can cause some engineers to not document their code! If you do use a documentation generation tool, make sure that you have strong internal training practices to get new engineers up to speed on how to use the documentation tools. Likewise, make sure you've thought of how engineers will search the documentation, and make sure the search works well.

README files can be a good, practical alternative to generated documentation. Because they're so simple everyone will understand how they work, where they live, and how to edit them. Modern formats like Markdown and reStructuredText mean that you can use README files and still get beautiful generated HTML.


Fedora, QEMU, and 9p Filesystem Passthrough

Previously I wrote an article about configuring NFS and autofs for use with a VM on Linux. The premise of that article is that you want to run a local VM, and you want to share files between the VM and the host machine. I described a mechanism for doing this by setting up an NFS server on the guest VM, and then using autofs on the host to automatically mount the guest filesystem as necessary. This does work, but I've since learned that it's not at all the best way to configure things.

First, I've realized that it's better to have the host OS also be the host for the filesystem. That is, the VM should mount the host filesystem, not the other way around. There are two reasons this is better:

The other thing I learned is that there's this thing called 9p filesystem passthrough, which allows you to "directly" expose the host filesystem to the guest. I put "directly" in quotes here because the filesystem still needs to be mounted using a non-local driver, in this case using the 9p protocol. The difference here, compared to NFS, is that on Linux 9p has a native virtio driver which means that all of the filesystem operations are directly virtualized by the hypervisor. This means that everything is fast and there's no weird network protocols (like NFS): everything is directly exposed by the kernel from the host to the guest.

Setting Up 9p Passthrough On Fedora

There are already a bunch of guides for configuring 9p passthrough on Linux, and I found that they all work except one step on Fedora, which is why I'm writing this post. The short version is that using virt-manager you expose a "filesystem" device and for the "source path" you point it at the directory on the host you want to share, and then for the "target path" you set up a unique tag like shared that you will be the name of the share. Then on the guest you mount the filesystem using the 9p mount type and using the options -o trans=virtio,version=9p2000.L,rw. I found this illustrated guide to be helpful; it even has pictures! If you don't want to use virt-manager, just Googling "qemu 9p passthrough" should yield plenty of results explaining the relevant qemu command-line options.

Now once you do all of this, on Fedora you'll run into a problem. If you did everything correctly you'll find that while the guest can read files, it won't be able to write to them, regardless of what fsdev security model you use. In other words, the mount will act like a read-only mount, even if you have it mounted read-write and have the right user ids and directory permissions and whatnot. More concretely, any system calls that open files for writing will fail with an EPERM error.

After a lot of head scratching I found this virt-users mailing list email about the issue. The issue is that by default on Fedora libvirt will arrange for the QEMU process to be run as a special, unprivileged user: the qemu user. The problem is that if you're trying to mount files in your home directory, the qemu user won't have the correct permissions to access those files. What you want, instead, is for the QEMU process to run as your own user id. This is easy to set up. Edit the file /etc/libvirt/qemu.conf and find the commented out lines for user and group, and make them look something like this:

user = "YOUR_USERNAME_HERE"
group = "YOUR_USERNAME_HERE"

Of course, the actual user and group on your system will probably be different (unless your Unix login happens to be YOUR_USERNAME_HERE).

After you make this change you'll have to restart any active QEMU processes, e.g. by restarting any guests you have in virt-manager. Then confirm that in the process listing the QEMU process is running as the expected user id, e.g. by examining ps -ef | grep qemu. If that looks OK, you should be able to log into the guest and write to files!

I hope this article helps save a fellow Fedora user some time, since it took me quite a while to hunt this down.


Inline C/Assembly In Bash

Recently I had the idea of trying to combine two of my favorite things: bash and assembly. My original idea was to try to make a way to inline assembly code into bash scripts. While I was researching this project I discovered Tavis Ormandy's ctype.sh. This is a really clever project that makes use of the nearly undocumented enable -f feature in bash to add new shell builtins that expose the functionality provided by dlopen(3). The project is extremely ingenious, and I would recommend browsing the code to anyone who interested in these kinds of low level things.

At first I wanted to try to extend bash in the same way that Tavis did to provide new builtins for inlining assembly code. It occurred to me, however, that it's not actually necessary to do this as long as the assembly can be built into a shared object. Furthermore I realized that because of the lack of a real type system in both assembly and bash that trying to actually inline assembly into bash would be really difficult for all but the most trivial functions. This is because bash primarily interacts with string data, and doing string manipulation and memory allocation in x86 is not fun. Tavis' code simplifies a lot of this already by implementing a type marshaling system. Plus it's already possible to inline assembly into C via the asm keyword, so by simply coming up with a way to inline C code I'd also be able to inline assembly as well.

Fortunately bash already has a syntax feature that makes writing "inline" code feasible: here documents. Here documents are a useful way to write multi-line literal statements without having to do lots of escaping. Using this feature we can add "inline" C code to our bash script, and then generate a DSO with a C compiler (both GCC and Clang shoudl work). The DSO can be loaded by ctypes.sh, and then our code will run.

I have an example demonstrating this here: github.com/eklitzke/c.sh. In this example I have written a "hello world" function in C and a trivial implementation of popcnt (which counts the bits in a word) in inline assembly. Hopefully the code should be pretty straightforward to follow.

I have some more ambitious ideas for this project. I'd like to try to embed the Python interpreter into a bash process. This would allow one to write "inline" Python code in a bash script, and then the Python functions defined would be executable from the context of the bash script.


A Funny Thing About C and C++

I recently read an interesting paper about the difference between the ISO C standard and C as implemented by compilers, and it got me thinking more about how people learn C and C++. Most languages are very complex, and for most questions that developers have it's impractical to try to find answers in the actual language specification. To start, it's frequently impossible to answer questions using language standards because it may not be possible to find the answer without knowing the correct terminology. Even in cases where one does know the terminology to look for, many language standards are so verbose that consulting them is impractical for a non-expert.

Instead, what most developers do when they have a question about a language corner case is to write a program that exercises the corner case and then observe what the program actually does. Most languages have a de facto reference implementation, and typically the reference implementation mirrors the specification almost exactly. For instance, if you have a question about how an obscure aspect of Python works you can almost always test a program using CPython to understand the dictated behavior. There are a couple of obscure things that are implementation-specific (e.g. what is allowed with respect to garbage collection), but in the vast majority of cases this approach works fine. This approach also works well with other languages like Ruby, PHP, Javascript, and Go.

C and C++ are very different. There is no de facto reference implementation of C or C++. There are a lot of different C/C++ compilers out there, and unlike many other languages the C and C++ standards are frequently finalized before any compilers actually fully implement the new semantics (this is particularly true for C++). Additionally there is a huge amount of "undefined behavior" allowed by the language specifications. Therefore when you write a test program in C or C++, you can't be sure if the observed behavior you get is actually part of the language specification or simply the behavior of a particular compiler. The problem is compounded by the sheer complexity of the language specifications. The last time I looked, the ISO C standard was 700 pages long. That's just C! I can't even fathom how many pages the C++ standard must be, if it even exists in a single document.

Another interesting thing about C and C++ is that for real-world programs to execute you must link them. Most of the details of linking are not specified by the language standards. This is extremely evident if you look at complex C or C++ code that attempts to be cross-platform compatible between Windows and Unix—generally there will be a plethora of preprocessor macros near function definitions to ensure that methods have similar linkage semantics on the two platforms. In a number of cases there are linking behaviors available on Windows but not on Unix, and vice versa.

What most people learn when they write C or C++ is how a particular compiler works, not how the languages are actually specified. This is why writing portable C or C++ code is so challenging. It's no wonder there are so few true experts out there in the C and C++ communities.


CRCs vs Hash Functions

In this post I am going to address some misconceptions on the difference and utility of cyclic redundancy checks (CRCs) and generalized hash functions, including cryptographic hash functions.

We'll start with an explanation of what the two are. Both CRCs and hash functions share the property that they take an input value and reduce it to a (usually shorter) output value. Typically inputs can be arbitrarily many bits in length, and the output is a small value, frequently in the range of 32 to 512 bits. By convention the output value for a CRC is called a "checksum", and the output value for a hash function is called a "digest".

CRCs are a type of error-detecting code used to implement checksums. CRCs are specifically designed to satisfy the property that they can detect transmission errors in data. The idea is that you're transmitting a signal over a potentially lossy medium (e.g. Ethernet or radio waves) and you want to detect any data loss or data corruption. The CRC is computed over the data payload and sent as part of the data transmission. The receiver can recompute the CRC and compare it to the received value to detect errors. No property of the CRC matters other than whether it differs when data corruption occurs. It isn't important if the CRC is biased in any way as long as transmission errors result in differing CRCs.

In general a CRC can be n-bits in length, but most commonly a 32-bit CRC is used. An n-bit CRC has the property that any error run of up to n bits is mathematically guaranteed to produce a distinct CRC checksum. This guarantee is possible because CRCs are computed using a special type of polynomial that is able to guarantee this property. As a special case, this typically means that an error of a single bit (or a single byte) will guarantee that the CRC checksum will fail, regardless of any other property of the transmitted data, including its length. This is a powerful property, since it guards against the most common type of transmission errors.

General purpose hashes, including cryptographic hashes, are a little different. Again, they take an input value and reduce it to a (typically smaller) digest value. Hash functions optimize for a different property, however. Hashes try to optimize for the case of reducing bias in the output digest value. That is, hashes should attempt to have unbiased output values even when the input is biased. Hash functions also try to optimize to reduce hash collisions for differing input values, but there are usually no guarantees made on what conditions can lead to a hash collision other than probabilistic ones.

It's inappropriate to use a CRC in place of a general purpose hash function because CRCs usually have biased output. It's equally inappropriate to use a general purpose hash function in place of a CRC because general purpose hash functions usually do not make any guarantees on the conditions under which hash collisions can occur.

Most modern cryptographic hash functions have very large digest values compared to what is typically used in a CRC. For instance, the most widely used CRC that I've seen used is CRC-32 which has multiple variations, but all of which produce a 32-bit checksum value. By contrast MD5 has a 128-bit digest value and SHA-1 has a 160-bit digest value. More modern cryptographic hash functions frequently have even larger digest sizes. Accordingly CRC values have somewhat fallen by the wayside in many modern applications. However, it's still important to understand how the two are different. In particular, it's probably a bad idea to use a hash function for a CRC if you are going to reduce the size of the digest to the same size as a CRC checksum anyway. It's also worth noting that the CRC algorithm can be parameterized by larger values for n, so tit's also possible to produce large CRC values such as CRC-128 (or higher).


GitHub Private Repos

I've been a paying customer of GitHub pretty much since the site launched, when I signed up in March 2008. Back when I signed up I was working on a number of projects that I wanted to keep private, including a few where I wanted to collaborate with other people. That has changed over the years. When I recently reviewed what repos I still had private, I found that the only ones I still cared about were a handful of things like my blog and my dot files that I was developing on my own (i.e. without collaborating with others).

For a non-corporate entity like myself I found that there are really only a few value adds that GitHub is providing:

I'm not really collaborating with others on the private repos I want to keep private; and if I were, I've found that I dislike the collaboration workflow that GitHub implements anyway. Hosting repos with git literally just requires having ssh access to the remote server and git itself installed, which I already have on the server running eklitzke.org. Doing secure backups to the cloud is trivial nowadays with S3 and GCP. So I decided to cancel my GitHub plan and just host my remaining private repos myself.

If you haven't ever hosted your own git repo before, here's what you do. On the server you want to host the git repo you do this:

mkdir ~/myproject.git
cd ~/myproject.git
git init --bare

Then in a checkout of your project (i.e. on the client) you do something like:

git remote add upstream myserver:myproject
git push upstream master

That's it. Now you can push/pull as usual, and to do a checkout you'd just do:

git clone myserver:myproject

Backups are also super easy. I'm using Google Cloud Storage, but you could just as easily use S3 or any other cloud storage service. Here's the actual script I'm using to backup each of the git projects I'm hosting in this way:

#!/bin/bash

set -eu
TARGET="$1.git"

if [ ! -d "$TARGET" ]; then
    echo "Cowardly refusing when $TARGET does not exist."
    exit 1
fi

BACKUP="${TARGET}.tar.xz"

function cleanup {
    rm -f "${BACKUP}"
}
trap cleanup EXIT

tar -cvJf "${BACKUP}" "${TARGET}"
gsutil cp "${BACKUP}" gs://MY-PRIVATE-BUCKET/git/

I have a cron job that runs this script daily for each of my repos. You could easily GPG encrypt the backups by piping the tar command to gpg -e, but I don't actually have any sensitive content in these repos so I don't bother.

One of the nice things about git is that even if my server goes down, restoring a git repo from a backup made in this fashion is trivial. Unpacking the backup tarball creates an actual git repo in the "bare" format, and converting from the bare format to the regular format is trivial (I would do it by passing the path to the bare checkout to a regular git clone invocation).

A few years ago I would have been happy to continue being a paying customer for GitHub just to support them as an organization, even if the value they were adding was small. Recently my thoughts on this have changed. Today I'm skeptical hat GitHub is providing a positive impact on the open source/free software movements, and I'm worried about the near monopoly status they have on new open source projects. This is a complex topic, so perhaps I'll flesh out my ideas on it more in a future post.


Memory Protection and ALSR on Linux

Until recently the exact model of how ALSR and other memory protection mechanisms work on Linux was something that I knew only at a high level. Recently I've done a lot of work where I've had the need to bypass these mechanisms (in a cooperative setting), and I want to explain for readers exactly how the memory protection model on x86/Linux systems works, what it protects against, and the ways it can be bypassed.

There are two major mechanisms in place to protect memory access that are turned on by default on most x86-64 Linux systems. The first is the so-called NX bit which is a setting that gives finer-grained permissions to mapped memory regions. The second is address space layout randomization (ALSR) which randomizes where certain parts of a program are loaded into memory. I'll discuss these two topics separately since they are complementary but completely orthogonal to one another.

The NX Bit

Traditional computer architectures that implemented memory protection mechanisms typically had two states that a mapped memory region could be in: read-only or writable. The exact details of the memory protection capabilities depends on the CPU architecture, but what I described applies to most 32-bit x86 systems. In addition to this read/write toggle the system would also implement some other basic memory protections such as ensuring that different processes cannot read or write to each other's memory regions.

The way this system is implemented is that each process has a sparse data structure allocated in the kernel called a page table. The page table contains a mapping from virtual memory regions to the physical memory associated (or to an offset in the swap area if the data has been paged out to disk). For each page in the page table there's also a read/write bit that is used by the MMU to enforce permissions for areas that ought to be read-only. Memory protection between processes occurs via the fact that the kernel will arrange for the MMU to use a given process' page table when that process is scheduled to run, meaning that in the absence of kernel bugs a process can't enter a state where it is using another process' page table.

There are a few different use cases for read-only memory regions, but probably the most important use case is read-only text segments. A text segment is the part of a process' memory that holds the actual machine code for the process. The reason it's desirable to map this area read-only is that it helps to mitigate attacks that attempt to inject malicious code into a process. If an attacker can inject malicious code into a process' code area then they can execute arbitrary code with the permissions of the process itself. In many cases these permissions are considerable, even if the process is not running as root. For instance, an attacker that can inject code can run arbitrary filesystem operations as the process' effective UID. When the text area is read-only one can be confident that the code in the text area that is executed has not been tampered with.

A big limitation here is that while we might think of memory as either belonging to read-only code and writable data, in fact some information about what code is executing is stored in the data area. In particular the calling convention on x86 works by storing the return address for a function call on the stack, and the stack must be mapped with write permissions. Thus an attacker that can stomp on memory on the stack can alter the return address for a function. This is an extremely well known problem and is known as a stack buffer overflow. Typically when people talk about "buffer overflows" they're specifically referring to stack buffer overflows, since stack overflows are usually the most dangerous kind of buffer overflow (although not the only kind).

On classic 32-bit x86 systems if one can overflow a buffer allocated on the stack then they can actually arrange for arbitray code to run. The way this works is an attacker will overflow the buffer with malicious x86 code and then arrange for the return address on the stack to have the function return to the malicious code. Thus in this setting any stack buffer overflow can easily be turned into arbitrary code execution.

The NX bit, which is present on all 64-bit x86 systems (and some later 32-bit x86 systems), is an important measure which helps mitigate this attack. Systems that implement this functionality have three permissions bits: read, write, and execute. These permissions work exactly the same as the filesystem read/write/execute bits. What happens on these systems is that text areas are mapped to be readable + executable and data areas are mapped to be readable + writable but not executable. It's also possible to map data areas to be just readable.

With the NX bit the simplest and most dangerous stack buffer overflow attack is prevented because any x86 code injected into the stack cannot be executed. An attacker can still arrange for an arbitrary return address to be used as the function return address, but unless that return address actually points to a valid offset in the text area the process will segfault (i.e. terminate with SIGSEGV). The process can also terminate with SIGILL if an invalid offset in the text area is used, since any offset that isn't aligned at an actual instruction offset will likely become an invalid sequence quickly.

The most dangerous attack that is still possible with stack buffer overflows when the NX bit is enabled is called a return-to-libc attack. The way this works is that if the conditions are just right an attacker can arrange for a function's return address to be the address of a sequence of instructions in libc that will cause the program to do something dangerous (e.g. exec a shell). The reason this attack is dangerous is because libc has helper routines for nearly every system call, plus numerous other high level functions that can be rather powerful. The reason this attack is much more difficult to execute than a typical stack buffer overflow is that not only must the attacker know exactly what address in libc to return to, the attacker must also arrange for the registers and return instruction sequence to be just right so the libc code is in the correct state to actually execute the malicious instructions.

Address Space Layout Randomization (ALSR)

ALSR is a mechanism that is technically complementary to the NX bit, but is made much more powerful when the NX bit is enabled. The reason is that ALSR is designed precisely to make return-to-libc (or return to any other shared library) much more difficult to execute.

The way that ALSR works is that when shared libraries are mapped into a process' memory they are loaded at randomized locations. For instance:

$ for x in {1..5}; do grep 'r-xp .*/libc' /proc/self/maps; done
7f2f5469b000-7f2f5484a000 r-xp 00000000 fe:00 31376                      /lib64/libc-2.19.so
7fba4ce49000-7fba4cff8000 r-xp 00000000 fe:00 31376                      /lib64/libc-2.19.so
7f063b82c000-7f063b9db000 r-xp 00000000 fe:00 31376                      /lib64/libc-2.19.so
7f45cbbd9000-7f45cbd88000 r-xp 00000000 fe:00 31376                      /lib64/libc-2.19.so
7f1c25f74000-7f1c26123000 r-xp 00000000 fe:00 31376                      /lib64/libc-2.19.so

The leftmost hex sequence in this output shows the offset that libc is loaded at for five different random bash instances. As you can see, in each invocation libc is loaded at a different memory offset. This means that the address of a given method in libc (e.g. the wrapper for unlink(2)) will differ in every process invocation. If an attacker wants to execute a method like unlink(2) they can't easily know exactly where the code for unlink(2) will actually be loaded in memory.

Another way to make this more apparent is to write a C program that prints out the address of a libc function each time it's run. Consider the following program listing:

#include <dlfcn.h>
#include <stdio.h>

// must use the exact version for libc!
const char libc_name[] = "libc-2.19.so";

int main(int argc, char **argv) {
  // find the libc that is already loaded in memory
  void *handle = dlopen(libc_name, RTLD_LAZY | RTLD_NOLOAD);
  if (handle == NULL) {
    printf("failed to find libc in memory!\n");
    return 1;
  }
  // locate the unlink(2) wrapper
  printf("unlink(2) is loaded at %p\n", dlsym(handle, "unlink"));
  return 0;
}

You can compile this with an invocation like gcc -ldl unlink.c. On a sequence of five different invocations I get five different addresses for unlink(2):

$ for x in {1..5}; do ./a.out; done
unlink(2) is loaded at 0x7f1d5501e540
unlink(2) is loaded at 0x7f384fbfe540
unlink(2) is loaded at 0x7f2c8f5de540
unlink(2) is loaded at 0x7f5669055540
unlink(2) is loaded at 0x7f8efd897540

Bypassing Memory Protection

Both of the protection mechanisms I just mentioned can be bypassed in a number of different ways, particularly in the "cooperative" case where the program author is intentionally trying to bypass the protections.

The NX bit can by bypassed by a caller using the mprotect(2) system call. The prototype for that function looks like this:

int mprotect(void *addr, size_t len, int prot);

In this function the prot field consists of zero or more of the read/write/execute bits masked together. After making this system call the memory region specified will be remapped with the new permissions in this prot field.

One fun thing that a caller can do is to set the text regions for a process to be writable. If this is done it completely bypasses the read-only state that code areas are usually mapped in. The caller can also map data regions (such as the stack) to be executable, which also effectively bypasses the usual protection mechanism. To actually do either of these things a caller must know the exact address they want to change the permissions for. Doing this is somewhat hampered by the fact that many parts of memory live at randomized offsets due to ALSR.

Since ALSR is the main protection mechanism that prevents an attack from trivially remapping memory regions with mprotect(2) the rest of this section will discuss ways that ALSR can be bypassed.

As I demonstrated briefly earlier, a process can see its exact memory layout by looking at /proc/self/maps. This file is the easiest way to work around ALSR protections. Since the file has a well known name and is an easily parseable format accessing it makes it easy to know exactly where everything is loaded and what permissions are enforced on each memory region. In a "cooperative" setting where you are intentionally reading your own maps file it just takes a couple of lines of C or C++ code to find things and remap them.

In practice it would be difficult for an attacker to use this file because it would require actually injecting code to open/parse/interpret the file. Normally the NX + ALSR mechanisms themselves ought to be sufficient to protect against this. In some cases an attacker might be able to use another process running as the same UID to parse a given process' maps file, but even that seems far fetched since the attacker still needs another sidechannel to get the data into the process. There are also various ways to isolate the process directories in /proc to protect against this kind of sidechannel attack.

Processing the maps file in /proc is not the only way to bypass ALSR. Processes can also locate functions in memory by looking at the function symbol tables. The demonstration C file I listed shows how to do this with dlopen(3) and dlsym(3) which makes things easy since libdl implements the actual low level bits that know how to do this function symbol parsing. However it's worth noting that libdl isn't doing anything magic here, and in the usage I showed it does not look at /proc/self/maps.

The way libdl works with glibc is that it contains the code that knows how to parse an actual ELF file. When you call dlopen(3) and dlsym(3) in the manner I demonstrated, libdl will locate libc-2.19.so by searching the library search path and then it will parse the symbol table to see what offset the unlink(2) method is located at. It can then use this to find the method in the process' symbol table. Therefore it is possible in principle for an attacker to use this kind of mechanism even if an executable does not link against libdl. Again, in an actual malicious scenario it is unlikely that an attacker could actually accomplish this since the logic to parse an ELF file and the symbol table is considerable and would have to be injected into a process.

There is another mechanism that is not well understood by many people; this is also what I think would be the most likely way that an attacker would try to bypass ALSR in real scenarios. For typical executables only shared libraries have their locations randomized. The text area for the executable itself will not use randomized memory locations. Therefore if an attacker has a copy of an executable, they can analyze the executable's code and determine effective places to return to. For instance, suppose an attacker wants to unlink a file. If the executable itself has code that calls unlink(2) then instead of trying to actually return to unlink(2), the attacker can instead return to the code that calls unlink(2), which will exist at a well known location. For many attacks this is still difficult since the attacker has to get registers and/or memory into the right state to pass the right arguments to the target function. Still, this is an effective way to bypass the randomization element that ALSR introduces. In some cases one could also use this mechanism even if the executable doesn't actually call the intended C function: the attacker needs only to find a sequence of executable bytes that can be interpreted as the desired function call. Thus one can intentionally jump to an invalid byte offset can cause a crash with SIGILL if one knows that right before the illegal instructions are hit there will be a brief instruction run that can be parsed in the desired way.

If you are worried about this last attack you can compile with the -PIE flag, which stands for position independent executable. This causes all of an executable's symbols to be loaded at randomized locations using ALSR. The executable will still have a fixed, well-known entry point, but this entry point will just arrange for the C runtime (a.k.a. CRT) initialization functions to load and from there everything will have an unpredictable memory address. Compiling with -PIE does have non-negligible overhead, particularly on 32-bit systems, but it's indispensable for security-sensitive executables such as ssh.

One final way to bypass memory protections, which I will mention just for completeness, is by using the ptrace(2) system call. This system call lets a remote process read and write a remote process' memory without restriction (i.e. it is not necessary to mprotect(2) anything). This system call is what GDB uses to probe remote processes. The ptrace(2) system call is very powerful, but because it's so powerful its usage is typically quite restricted. At a minimum you need to have the same effective UID as the remote process in order to ptrace it. Additionally there is a file called /proc/sys/kernel/yama/ptrace_scope which can be used to neuter the ptrace(2) system call. By default Debian and Ubuntu ship kernels that have the Yama ptrace scope configured so that only the superuser can ptrace other processes (and only the superuser can change this setting). In general if you can ptrace another process you have unlimited ability to do anything as that process so while ptrace can be used to bypass memory protections, any situation in which you can ptrace another process is a situation in which you already are able to compromise the process.


C++ and Thoughts On Java, Go, and Rust

Historically most programming situations that called for high performance, particularly in the field of systems programming, would have been written in C or C++. That has changed a lot over the last twenty five years or so. Java has increasingly been the programming language of choice for high performance server applications. Recently Go and Rust have made inroads as high performance alternatives to Java that claim to offer similar or better performance. Yet I still find that I really love writing C++, and at least for situations where speed or memory utilization are important I prefer writing C++ to any other language. In this post I will explore some of the aspects of C++ that I find appealing, and why Java, Go, and Rust don't make the cut for me.

Why Not C?

I'll start with the easiest question that I am frequently asked when the topic of C++ comes up, which is why I don't like writing C. Actually I don't mind writing C, particularly for smaller programs. Except for a few mostly unimportant corner cases you can think of C as a subset of C++, and therefore a lot of the things I like about C++ also apply to C.

Given the choice though, I do prefer C++. The most important reason for this preference is that C++ has much simpler error handling than C. A commonly occurring pattern in all programming languages is the need to clean up resources after an error has occurred. In languages with garbage collection features this will usually be handled by the garbage collector. C and C++ do not have garbage collectors, but C++ does have an elegant solution to this problem by way of object destructors. Resources that must be freed on error are typically wrapped in C++ by a class or struct that implements a destructor freeing the resource; this pattern is called RAII. The RAII pattern ensures that you can return out of a function at any point and any classes allocated on the stack prior to the return will have their destructors run immediately. This is simply not possible in C. Instead, C encourages the use of goto statements for error handling. In general I find that these statements make the flow of execution harder to understand, especially if a function allocates more than one or two resources. The goto facility also requires that the programmer remembers to use them in every situation where a resource is allocated, whereas C++ object destructors are always run automatically leading to fewer chances to make mistakes.

Another important reason I prefer C++ to C is that C++ offers a much more complete standard library, particularly with respect to high performance data structures. The C standard library does not really give you any data structures at all. If you want to use things like dynamically sized vectors or hash tables you either need to implement them yourself, or you need to pull in a nonstandard third party library like GLib. As just discussed, these data structure libraries require the programmer to do a lot of manual error handling to ensure that resources are freed in all cases.

Most C data structure libraries make extensive use of preprocessor macros and/or function pointers to implement polymorphism. Both of these approaches have serious limitations. Preprocessor macros operate on a textual level and therefore can easily lead to surprising bugs and compiler errors if used incorrectly. Many text editors and IDEs have difficulty understanding macros for a few inescapable reasons. For instance, it's common to specify preprocessor defines as compilation flags, meaning that if your editor isn't tightly integrated with your build system it has no way to actually understand how macros will be expanded. Function pointers have a different set of problems. For one, the syntax for using function pointers is widely regarded as confusing. The other problem with function pointers is that they usually cause the program to have extra indirection for function invocation because they require an extra pointer lookup. Function pointers also prevent inlining, an important compiler optimization. For many data structure applications this causes significant performance overhead compared to C++ data structures implemented using templates. Some C programs try to work around this problem by way of extensive code generation which allows the compiler to inline functions, but doing this effectively with the cpp preprocessor is either impossible or incredibly verbose and error prone.

Due to these data structure problems, most C++ programs using the STL are faster than C programs using equivalent data structures. Furthermore templates offer a lot of safety features not available to C programs using macros/function pointers. It's always possible to write C code that is as fast as C++, but frequently doing so requires either manually inlining things or the use of novel code generation tools.

The final reason that I prefer C++ to C is that in cases where one does want to call into C code, it's incredibly easy to do so from C++. All major C libraries that I can think of are annotated to allow use from C++ code. Typically this just requires a small amount of boilerplate at the top and bottom of the header file, like this:

#ifdef __cplusplus
extern "C" {
#endif

/* C code goes here */

#ifdef __cplusplus
}
#endif

What this does is ensure that the C++ compiler won't attempt to do name mangling for code in the extern "C" block. This not only allows C++ code to include these C libraries, it can also be used to write C wrappers to C++ code. The ease of calling C code from C++ means that there aren't generally features or libraries that would be available to you in C but not available from C++; and if you do find such a library, it's trivial to modify it to be compatible with C++.

Why C++

C++ has a number of unique features that aren't available in competing languages like Java, Go, and Rust. Some of the features I'll outline here are available in some of these environments, but none of these languages offer all of these features.

The first feature that I really like is C++ templates. This is a widely reviled language feature, and indeed when overused templates can cause a lot of problems, particularly with respect to understandable code and compiler error messages. That being said, C++ templates are an extremely powerful metaprogramming capability, and must more powerful than C macros. I'd rather be given enough rope to hang myself than not have it at all. This is also my main objection to Go: Go has no serious generic metaprogramming capabilities unless you count go generate (which is arguably a lot more confusing than C++ templates). The ability to do polymorphic metaprogramming is an essential part of DRY. As a programmer there's nothing I hate more than writing tedious boilerplate that a compiler could implement automatically (and more succintly than I could). Templates are an extremely powerful way to write short, correct, and fast programs.

Another feature that I've really come to appreciate recently is first class support for assembly language. The __asm__ keyword is reserved in both C and C++ explicitly to allow compilers to provide the ability to inline assembly code. This is a relatively unique feature. Of the alternative languages I mentioned, only Rust supports inline assembly. One should be judicious when using inline assembly, but in cases where it comes up it is absolutely indispensable.

Aside from supporting inline assembly, all major C and C++ debuggers have extremely high quality support for inspecting generated assembly and stepping through assembly. This is a critical feature for understanding how code is optimized and for debugging certain situations. For instance, if you want to understand how indirect jumps work, what's happening in LTO code, or how inlined code works you need to be able to inspect and step through the generated assembly. Once again Rust is the only alternative language out there that really provides first class support for this, in Rust's case by piggy backing off of GDB/LLDB, the GNU and LLVM C/C++ debuggers.

C++ doesn't have an included runtime, by which I specifically mean that there's no garbage collection that will interfere with your optimized code. This is a big deal for a couple of reasons. The first is that for high performance code garbage collectors can easily get in the way much more than they help. It's not uncommon to hear of Java programmers doing things like majorly refactoring their code to make use of things like object pools to try to reduce the frequency with which the garbage collector runs. While this can be an effective technique at mitigating GC pauses, it's an example of a situation where the "feature" of garbage collection causes developers to try to actively work around their programming environment. I content that this is especially bad because it causes developers to express application logic in unintuitive ways.

The other major problem with garbage collected languages is that they have serious problems dealing with large heap sizes. This is particularly a problem for databases and database-like applications. By database here I mean any program that has a working data set larger than a few gigabytes, and in particular any program where the data size might exceed the amount of memory on a machine. If you try to actually map all of this memory into a runtime like Java the garbage collector will have huge problems with latency (or throughput) as it scans the heap for unreferenced objects. A common way that programs will work around this is by using a structured on-disk format and then using regular disk I/O to access the data. This can be somewhat effective because the kernel page cache will cache recently accessed data in memory. The page cache, however, is always less effective than mapping data into RSS memory because accessing data via the page cache requires extra system calls to be made. This is precisely the reason that high performance database engines like InnoDB (which is written in C++!) map this type of data into userspace memory and avoid using the page cache.

One other problem I've noticed with modern C++ alternatives is that they often have poor support for advanced operating system primitives. For instance, if we continue considering the use case of implementing a database, the Linux kernel offers a number of advanced features for optimizing I/O. Most people should be familiar with the system calls read(2), write(2), and lseek(2). These are the basic foundations of doing disk I/O. There are some less well known alternatives to these that are designed to optimize the number of context switches applications need to make. For instance, the system calls pread(2) and pwrite(2) allow applications to coalesce read/seek and write/seek into one system call in each case respectively. Likewise, Linux offers "vectorized I/O" capabilities via the system calls readv(2) and writev(2) which allow a read or write system call to specify the input/output data as a set of multiple buffers which allows the application to avoid doing extra memory copies (or extra system calls). So far the system calls I've mentioned are not too well known, but critical for writing high performance applications. Linux even goes so far as to implement the baroque system calls preadv(2) and pwritev(2) which allow combining read/write operations with seeking and vectorized I/O, all in a single system call! This is extra fast but also extra unportable; for instance, OS X does not implement preadv(2) or pwritev(2). And all of these things go out the window when trying to target Windows which has a totally different set of system calls for doing these types of operations. Therefore a lot of these system calls are unavailable in runtimes that try to offer portability as a major selling point. As far as I can tell you can't invoke preadv(2) or pwritev(2) from Java or Go (but you can in Rust if you don't mind calling into unsafe code). Java and Go do implement pread(2) and pwrite(2) but only if you use special interfaces; for instance, with Java you must use java.nio to use pread(2). These languages at a fundamental disadvantage for implementing high performance database applications because the high performance system calls are either unavailable or difficult to use.

I've used some of the advanced I/O capabilities in Linux as an example here, but there are simpler examples that fall apart too. Try calling fork(2), a Unix system call coming up on 50 years of age, on Go, Rust, or Java. None of these languages support it because forking a process requires careful handling of file descriptors that survive the fork. As far as I know none of these language support fork specifically because they internally use evented I/O loops and no one knows how to expose this in a sane way to application developers. This is unfortunate because one of the main use cases for forking is forking a process early in its life cycle before there even are a lot of file descriptors to worry about, for instance to create a daemon process. You might also run into problems forking a complex C++ application that has a lot of open file descriptors, but at least you have the option to handle the matter, which you don't have at all in other environments.

Why Not Rust

From the analysis I've presented so far it should be somewhat apparent that of the alternative languages to C++, Rust is the one that offers the most features I enjoy from C++. And indeed, I do think that Rust is pretty cool, and it's a lot more palatable to me than Go or Java. Rust is definitely more targeted towards C++ programmers than either Java or Go, and that is borne from its history as an attempt to develop a real world alternative to C++ for use at Mozilla.

The main gripe I have with Rust is that while it does have a lot of really powerful features, it's unclear to me what the major advantages it's trying to offer over C++. For instance, one of the major language features touted in Rust is pointer ownership with move semantics which is a powerful way to avoid memory leaks via static verification that memory is freed, and with no runtime overhead. This is indeed a cool feature. It's also exactly equivalent to a C++ std::unique_ptr. Therefore if you want to write code that statically verifies correctness with respect to releasing allocated memory and with no runtime overhead you can just as easily use C++ as you can use Rust.

In fact, if you've been following the recent language developments in C++ (e.g. the recent C++11 and C++14 language standards) you'll find that many of the advanced Rust language features were added to C++ before or at around the same time as their introduction in Rust. Unique pointers are one example of this, but it also applies in other languages such as builtin concurrency primitves and move semantics. There are some unique things about Rust such as a lack of null pointers, but even that guarantee goes out the window if you need to call unsafe code, which you will if you are doing advanced system programming. If you've kept abreast of the latest C++ developments you might appreciate the simplicity of Rust compared to C++, but it's hard to be impressed by the actual language features.

I'm going to keep my eye on Rust because I do think that it shows a lot of promise, but for the time being the language features it offers over C++ isn't enough in my opinion to give up the tight integration C++ has with C and the native debugging support that GDB offers with C++.

Conclusions

The number of C++ programmers out there is probably diminishing, and certainly the core group of C++ developers in the wild is getting older. This is a direct consequence of the fact that the space historically occupied by C++ is being crowded by other high performance languages. That being said, I don't see C++ going anywhere anytime soon. C++ has been extremely successful at incorporating new features and modernizing itself over the past few years. C++ is also unique in its facilities for tight interoperation with C and assembly.

If there's any unifying principle in the C++ language specification and the way the language has been evolved, it's been to make C++ fast. A lot of the aspects of C++ that people dislike are areas where the language has chosen to give more options to allow compilers to generate fast code at the expense of making the language more complicated or less understandable. It's natural that many developers will feel like that's the wrong tradeoff: that developers spend more time trying to understand and debug programs than they spend trying to eke out every last bit of performance. That's definitely true in many applications, but the fact also remains that there are a lot of specialized situations where performance is paramount. In these situations it's hard to beat C++.


The Curious Case of Position Independent Executables

Most Unix systems, including Linux, use the ELF format for executables and object files. Normally the details of ELF files are invisible to developers, but certain tasks can call for one to peer into their inscrutable depths. One reason that you might need to parse ELF files is when trying to find symbols in another process using the ptrace(2) system call. In particular, by resolving the symbols in the ELF file for a remote process you can do things like figure out which symbols are available in the remote process and where in memory they can actually be found.

Another reason you might explore this route is by doing hacky things with ELF executables, which I intend to describe in this article. While going down this road I learned some interesting and poorly-documented things that I hope to shed some light on for other developers out there.

Quick Aside: Parsing ELF Files

ELF files are composed of a bunch of sections and headers that can be expressed as C structs. That means that you can use something like mmap(2) to directly map the on-disk representation of an ELF file into C data structures. The ELF specification goes into great detail about how all of these structures work. As it turns out, GNU libc (a.k.a. glibc) ships with a header file called elf.h that contains the C struct definitions for you.

GNU libc is licensed under the LGPL which means that if you aren't making modifications to it, you can dynamically link against it in your applications without having it affect the licensing terms of your own code. This means you can use elf.h (and the rest of glibc) freely in your own code. However, you may find that this approach is rather low level, and if you do take this approach you will have to learn a lot of the intricacies of the ELF format to find your way around the various sections and tables.

The details of elf.h are documented in elf(5), meaning that you can read the documentation with the invocation man 5 elf.

If you have the option, I strongly encourage you to instead look at GNU BFD, the obscurely named GNU "Binary File Descriptor" library. BFD provides the basis for GNU binutils, the standard command-line utilities for working with object files. In particular, GNU BFD is the library that is actually used by such tools as ld (the GNU linker), as (the GNU assembler), gdb (the GNU debugger), and other tools you may have heard of like nm, objdump, and readelf. This means that you're using the exact same library that is actually used by the linker/debugger on your machine. Additionally, BFD provides a high-level abstraction so you can just open a file and do things like get symbols without worrying about the low-level details of the ELF format.

There is one catch with BFD, which is that it is licensed under the terms of the GPL. This means that if you use GNU BFD and you want to distribute your work you will have to license your own code under the GPL, which is not the case if you include elf.h when building your application.

ELF File Types

In the header for every ELF file there's a field called e_type which indicates what type of ELF file it is. The ones you should expect to see (besides ET_NONE) are:

Symbol Tables

Within the ELF file there will be a number of "sections", and among the section types there's a type that elf.h calls SHT_SYMTAB which holds the symbol tables. An ELF file can have more than one symbol table.

By convention if there is a symbol table called .dynsym it holds relocatable symbols. Relocatable symbols are symbols built in such a way that they can be relocated to arbitrary virtual memory addresses. Typically if you look at a .so shared object file you'll see that all of the visible symbols are put into .dynsym.

By convention if there is a symbol table called .symtab it holds non-relocatable or non-allocatable symbols.

One of the fields in the ELF header holds the "entry point" for the file, which is the virtual memory address that the system will transfer control to when starting an executable. That is, the entry point holds the virtual memory address for the code that bootstraps start up of the process. You can think of this as your main() function although in fact the compiler will generate some stub code that gets invoked before main().

When you create an executable all of the code that you write will typically be put into non-relocatable addresses. The way this works is the linker decides that it's going to put the entry point at an address like 0x400410 and then other symbols will be put at nearby memory locations. So you might end up with your function foo at 0x400800 and your function bar at 0x400896. If foo calls bar the static linker can emit machine code that literally loads the memory at address 0x400896 which is simple and fast.

When you create a shared library you don't know up front what memory addresses will be available to you. Your library might want to put a symbol at 0x400800, but if the program loading your library already has the memory address mapped then things won't work right. You can imagine what would happen—you'd end up with either a situation where the library jumped into arbitrary positions in the program, or the program would jump into arbitrary positions in the library code. Either case would lead to an unpredictable program that would quickly crash with a segfault or illegal instruction. There are different techniques to solve this problem, but the short version is that when you have a shared library the generated machine code won't use hard-coded memory addresses. Instead, the generated machine code will be such that it looks up the real address of the target function at runtime. It is the job of the dynamic linker to make sure that when these symbols are loaded into memory that everything is set up properly, so any stub code will have the correct addresses.

There is some overhead for relocating code like this, so in general it's faster to use hard-coded addresses, but this technique cannot be applied to shared objects.

Things like debugging symbols can also be put into the symbol tables. Since debugging symbols do not need to be mapped into virtual memory at runtime these symbols are called non-allocatable.

To recap:

When the dynamic linker loads a shared object it will only look at the .dynsym table. The same is true of dlopen(3), which will only find symbols that are defined in the .dynsym table.

Position Independent Executables

You can use the -pie or -PIE flags to GCC to create what is called a "position independent executable" (a.k.a. PIE). When you do this GCC will only generate relocatable code. There is still an entry point for the program with a hard-coded address, but all the entry point does is set things up to run the relocatable code.

The main use case for this is ALSR hardening. When a PIE ALSR binary starts up the kernel picks a random virtual memory address to load all code other than the entry point stub at. This makes it harder to exploit a large class of security vulnerabilities common to C/C++ programs. Most Linux distributions do not compile typically binaries with this option because there is real, measurable overhead to invoking relocatable functions. Distributions like Debian and Ubuntu only compile particularly security sensitive binaries (e.g. ssh) as PIEs. (Traditional non-PIE binaries will still use ALSR on Linux, but only for loading dynamic libraries).

There is another interesting use case here though. As I mentioned earlier, the dynamic linker and dlopen(3) can find symbols in the .dynsym table but not symbols in the .symtab table. However, by default executables don't put their symbols into .dynsym. If an executable is created as a PIE the linker has the option to put the symbols into .dynsym table. If this is done then the symbols will be available both to the executable as well as to the dynamic linker and dlopen.

By default GNU ld will not put symbols into .dynsym for PIEs, even though the symbols are relocatable. However, by invoking ld -E you can ask the linker to export the symbols as dynamic symbols. This doesn't change the generated code, it simply adds the symbols to the .dynsym table which takes up a small amount of additional disk space. If you want to export only certain symbols you can use the --dynamic-list option to control the exported symbols.

By doing this you can create an ELF executable that can be both run on the command line as well as loaded dynamically by dlopen(3). There are a lot of strange things you can use this for. For instance, I am using this technique to create an executable that is also a Python module. This is mostly for fun—I could just as easily set up the build system to create an executable and a shared object separately—but I think it's pretty neat.

I have put a simple example demonstrating the concept on GitHub. If you modify the Makefile to not use -Wl,-E you will see that the dl program will fail to load the symbols.


GNOME Terminal Server

If you are using a Linux desktop then there's a pretty decent chance that the terminal emulator you're using is gnome-terminal, the GNOME terminal emulator.

I have a bad habit of opening way too many GNOME terminal emulator windows on my deskop. At the end of the day I'll easily hit a dozen open gnome-terminal windows. As a consequence of this I have noticed a few really interesting things about how gnome-terminal works.

First: while there is a command called gnome-terminal, you won't see any instances of gnome-terminal in your process listing. Instead you'll see a single process called gnome-terminal-server that is bound to the same TTY as all your other X11 processes.

Second: the gnome-terminal-server instance is multi-threaded, but not thread-per-window. On my machine it always uses four threads even though I may have more (or fewer) than four terminal instances.

Third: the gnome-terminal-server process can get "blocked" and therefore prevent terminal input or output on other terminals!

I will explain what I mean. I have previously written about how I use autofs to develop with NFS and virtual machines. Lately I have had an issue where the first time I try to access the autofs system (which has fully booted at that point) sometimes the initial I/O commands in the mounted directory get "blocked". What I mean by "blocked" is that any I/O over the autofs directory blocks for a few seconds when I first try to access it—and then I'll see all the processes finishing their I/O at once. I have written another article about the "uninterruptible sleep" state on Unix, and in that article I specifically discussed how historically that program state is associated with NFS.

I haven't debugged the issue yet. It's probably something I can work out by looking through the appropriate log files on the NFS server (or enabling the right verbosity of something to get the logging to happen). Since it only happens once per boot and blocks for a brief time (usually less than a minute) I haven't devoted my time to tracking down the root issue yet. My theory is that it's related to my PS1 prompt variable having overly complex git state tracking—I run git status at least once every time a prompt needs to be rendered, and the git status command has to do a ton of networked I/O operation to read the .git/ directory and its subdirectories, etc. This is just a hunch though.

Today I rebooted my machine and in the first gnome-terminal emulator instance that I opened I tried to cd into a git directoy within the autofs system. This caused my system to pause, as I have previously observed.

What was new today was that when I launched new terminals they were all blocked as well! I tried to open multiple terminals to diagnose the issue as it was happening, but since I didn't have any terminals open yet (this being the first terminal emulator process) I wasn't able to actually really debug anything with gdb or strace. After a minute or so the initial window's cd completed and immediately all of the other windows were rendered as well.

What does this mean?

Well, first of all it's interesting that gnome-terminal was designed like this, i.e. without a thread/process per window. It's easy to see why someone might have tried to make this "optimization": it reduces memory footprint to multiplex multiple terminals at once.

It also means that internally within gnome-terminal there's some kind of mutex that it's possible to hold while entering the uinterruptible I/O state. This is a bad practice: because certain I/O operations can block for a long time (potentially forever), one should not hold any mutexes while doing an uninterruptible I/O operation.

Postscript

My blog has quieted down recently—it's been nearly a month since my last post, which is one of the longest periods since I restarted blogging recently.

Do not fear, intrepid reader! I have not given up blogging.

Rather, I want to start producing higher quality posts in general. That means fewer ad-hoc posts like this one, and more posts that are systemic narratives of interesting systems.

The next series of posts that I hope to publish involves telling a very long multi-article story about the free software movement, what it means to me personally, and how I have seen it change over the years. I want to talk about the various actors in the free software movement, especially "corporate" influence on free software. I also want to write about the x86/Python/weird low level ptrace stuff I've been doing recently.

Doing this involves a lot of work:

In other words, to get better content I need a better copy editing system. I plan to make some behind-the-scenes changes to how I write blog posts so that I can (hopefullly) write new content in Google Drive and then have it rendered the same as currently. That is a lot of code that I have to write still.

In other words, I hope to re-launch my blog with higher quality content soon. Stay tuned.


The Observable Universe

American astronomer Edwin Hubble was the first to notice that the universe is expanding. By "the universe is expanding" I am actually referring to the metric expansion of space.

This means (for reasons currently unknown to us) that space is literally expanding all the time. The average distance between the Earth and any other object in the universe is increasing over time. This has been confirmed by many experiments, but the fundamental cause of metric expansions is currently unknown.

Based on symmetry/randomness arguments, we assume that the Earth is not the very center of the universe. In fact, unless we have evidence to suggest otherwise, we should assume that the Earth just has a random location in the universe.

There is something in the universe called the cosmic microwave background. The cosmic background microwave radiation is basically a weak microwave signal that we can detect by looking at any part of the universe. The reason it exists is because certain theories posit that this type of radiation be released during the big bang once the universe got cool enough—about 380,000 years after the initial big bang, according to the current math. With very powerful telescopes we can measure the cosmic microwave background radiation to a very high degree of precision. Then with a bunch of math we can try to figure out how old the CMB is and how strong it is. This is the oldest light that we can see.

Here's where it gets interesting. We know the universe is expanding. Current evidence suggests that the rate of expansion is increasing. All of the laws you've read about concerning the speed of light being constant does not apply to the metric expansion of the universe. Imagine the universe is a big sphere. That means that if you're sitting on the edge of the sphere trying to look at a light pulse emitted on the other edge of the sphere, if the metric expansion of the universe is increasing faster over the total distance than the speed of light, you'll never be bale to observe the signal.

This is what we mean by the observable universe. The observable universe is what we can see, in principle, base on the metric expansion of the universe. For us the cosmic background radiation is the edge of our observable universe. What's really cool is that the observable universe is shrinking due to metric expansion. There are stars that we can, in theory, see today but won't be able to tomorrow due to metric expansion. The implication is that in many billions of years, if human still exist, space will look a lot bigger and a lot emptier than it does today.

This is sometimes contrasted to the universe at large. This means we could live in a universe where there are things so far from us we'll never see them. Literally, given infinite time, the light would never reach us due to cosmic expansion. Actually, we know this is the case. The universe is so big we'll never be able to see it all.

It's interesting to think of the ancient star systems out there—made of exotic materials, possibly hosting intelligent life—that is part of our universe, and yet we'll never be able to see or interact with.

The size of the actual universe is grater than the observable universe. On of the major questions in physics today is how much they differ. If the observable universe is mostly the size of the actual universe then the distinction is mostly hypothetical. On the other hand, if the actual universe is much larger than the observable universe then it's possible that there are current problems in Physics like baryon assymetry


The x86-64 Red Zone

The x86-64 ABI makes an interesting guarantee. It guarantees that at any time you can access up to 128 bytes past %rsp. This is called the red zone.

This is really useful for GDB scripting because it gives you 128 bytes that you can just use without calling malloc() or related routines. Besides the fact that this is more efficient, it might be necessary because calls to userspace methods like malloc() don't always work right in GDB (e.g. if you attach to a process that is itself calling malloc(), your call to malloc() from GDB could deadlock).

This is really useful for functions that need a pointer to memory that normally you'd pass on the stack. For instance, typically you'd allocate the struct rlimit to pass to getrlimit(2) on the stack. You can modify the stack in GDB by modifying %rsp but the red zone feature means in most cases there's no neeed.

I wrote a GDB script for "fixing" the rlimit of remote processes that uses this feature. You can find the code here on GitHub.


Free Medical Data

A lot has been made over "free" software—what it is, how it's different from "open source" software, the merits of copyleft v.s. non-copyleft free software, and so on.

One issue that has come to my attention recently is that many medical data sets are proprietary, and this leads to worse patient treatment options.

Here's an example. Let's say you have some sort of cancer, and there are several treatment options available (e.g. radiation therapy, chemotherapy, surgery) to try to treat the cancer. There is something called a "nomogram" where doctors take a bunch of other historical (anonymized) cases with their pre-surgery data points, the surgery option chosen, and the outcome. Based on these numbers they give you an answer like "option X has A% chance of curing you", "option Y has B% chance of curing you", etc. Here's a concrete example. Let's say you have prostate cancer which has been confirmed by measuring your blood PSA levels and have the cancer has been confirmed by a prostate biopsy test. Based on these factors, and any other relevant factors (age, weight, etc.), they're able to create what is called a nomogram. The nomogram tells you for your specific numbers what they estimate you'll be fully cured of prostate cancer (measured after 5 years) in different situations, e.g. you chose radical prostatectomy as your treatment option instead of radiation therapy.

I'm not sure of the math behind this, but I believe they use some sort of clustering algorithm to find similar patients and calculate a score based on their treatment results and how similar you were to those patients.

This is really cool, and it lets doctors choose the best treatment option to patients based on statics of thousands of previous patients. In many cases there is some treatment option that is usually best, but under various special circumstances an alternative is better; this system lets the doctor really choose the best option. For instance, in my father's case normally a radical prostatectomy would be the treatment option for prostate cancer, but based on his nomogram it was discovered that radiation therapy has a much better treatment rate.

Unfortunately, basically all of these nomogram databases are proprietary. The way it works is a hospital internally collects these numbers, and may share this data with other hospitals (I'm not sure under what IP terms). Then as a hospital you have to choose which nomogram database to use. Typically you'd be paying for such access, and the quality of the nomorgram data is based on how many data points are in that nomogram.

Fox Chase Cancer Center has a large online free nomogram database for various cancers. In addition to their own data, which is signficant, Fox Chase has a way for other hospitals to submit their own nomogram data, which increases the total information and helps doctors lead to more accurate predictions. I don't know what the data licensing terms are; presumably you cannot directly download the Fox Chase cancer nomogram data. But at least you can use their online nomogram tools for free.

This issue also recently came up when I broke my scaphoid bone (wrist fracture), and then my sister broke her shoulder (humerus fracture). There are a number of treatment techniques. Fifty years ago, for either injury they would have just put you in a cast/sling which turns out to have a pretty high long-term cure rate. Now for both injuries there are multiple treatment options—you can wear a cast and not have surgery, and if you do have surgery there are multiple surgery options. For instance, when my sister fractured her shoulder the main two surgery options were plate and screw fixations vs. inserting an intramedullary rod.

There are a bunch of studies of techniques like this that you can find at the National Center for Biotechnology Information which is part of the NIH. For instance, here's a study on intramedullary rods vs plate and screw fixation to fix humerus fractures;

However, there a number of problems with this:

It can be hard to quantitatively measure what it means to be "cured" after a surgery like a bone fracture (e.g. how much mobility do you regain, what's the recovery period, etc.) but I think these issues could be worked out. My doctor appeared to be well read on the available literature, but I would have felt a lot more confident about my surgery (and my sisters') if I knew the doctor was making his decision based on a statistical analysis of thousands of similar cases, rather than just what he would "normally does" in my case and what a couple of Elsevier articles with small data sets suggest.

The Department of Health has a lot of issues on its hand, but this is one that I think they should focus on seriously. Consider the following class of medical conditions:

In every case I belive the NIH should build an open (anonymized) database about the pre-treatment data, treatment option chosen, and the efficacy of the treatment. In some cases (say, for very rare conditions) it may not be possible to do this while observing privacy concerns, but surely we can come to a common ground where we take common medical problems (many forms of cancer, bone fractures, etc.) and then use these databases to make medical treatment decisions.

Hospitals can be made to submit such data to the NIH as a result of these treatments (in fact, I wouldn't be surprised if they already do). The NIH can enforce this by making this type of data-sharing contingent of funding to the hospitals from the NIH.

I sincerely hope that an effort like this happens in the future. It could save millions of lives, save people from unnecessary pain, and I think frames the current hot-button topic debate of "intellectual property" in a good and reasonable way.


Ptrace, Syscall, and Python

I was hunting down an obscure interaction between ptrace(2) and what happens when you attach to a process which is in the middle of a syscall. In fact, if you read the ptrace man page there are a lot of options related to how syscalls are handled, and I tried a variety of them (e.g. PTRACE_ATTACH v.s. PTRACE_SEIZE followed by PTRACE_INTERRUPT). I got stumped on this problem for most of today, and eventually went to go write a simple test case that didn't involve the Python interpreter so I could get a better understanding of what was actually happening.

Oddly, my simple test case worked totally fine—I had no issues with ptrace() while the program was in a syscall state. In fact, by ptrace() didn't even cause the syscall to EINTR.

I eventually figured out what was going on: it's related to some weird magic in the Python interpreter. I would write about it here but I already wrote about the issue in great depth in the GitHub repo. So if you want to read the rest of the story go to eklitzke/ptrace-syscall and check out the README.