How SystemTap Userspace Probes Work

March 1, 2018

I've been working on adding SystemTap probes to Bitcoin (see my [probes_sdt] (https://github.com/eklitzke/bitcoin/tree/probes_sdt) branch on GitHub). SystemTap is the Linux analog of DTrace (of BSD/macOS fame). You can define probe points in the kernel or in userspace applications, and then write scripts in the SystemTap language that watch for events on these probe points. The SystemTap scripting language is quite powerful, with loops, functions, and data structures like hash tables. The scripts you write get compiled and run as native code. One of the most compelling features of SystemTap is that these probe points incur very low overhead, so they can be added to code and left enabled, even for non-debugging binaries.

Adding SystemTap probes to Bitcoin made me curious about how these probes are implemented, and how much overhead they incur. In particular, I wanted to learn:

What extra instructions are added to executables with SystemTap probes?
What happens when you actually trace a process using stap?
What's the overhead in the untraced case?
What's the overhead in the actively traced case?

There are some notes on the userspace probe implementation on the SystemTap wiki, but they weren't detailed enough for me. Thankfully I'm pretty handy with GDB, and was able to figure it by looking at the generated assembly code and process memory space. Here's what I learned.

A Look At Some Instrumented C++ Code

In this example I'll look at a probe I added to the Bitcoin IsInitialBlockDownload() function. This is a function that checks whether a Bitcoin node is still in its initial data download phase, or whether the node is "caught up" and fully synced. The source code for this function (with my modifications) is reproduced below:

bool IsInitialBlockDownload()
{
    // Once this function has returned false, it must remain false.
    static std::atomic<bool> latchToFalse{false};
    // Optimization: pre-test latch before taking the lock.
    if (latchToFalse.load(std::memory_order_relaxed))
        return false;

    LOCK(cs_main);
    if (latchToFalse.load(std::memory_order_relaxed))
        return false;
    if (fImporting || fReindex)
        return true;
    if (chainActive.Tip() == nullptr)
        return true;
    if (chainActive.Tip()->nChainWork < nMinimumChainWork)
        return true;
    if (chainActive.Tip()->GetBlockTime() < (GetTime() - nMaxTipAge))
        return true;
    LogPrintf("Leaving InitialBlockDownload (latching to false)\n");
    latchToFalse.store(true, std::memory_order_relaxed);
    if (PROBE_FINISH_IBD_ENABLED()) {
        PROBE_FINISH_IBD();
    }
    return false;
}

This function checks various heuristics about how far along the chain is, and eventually returns false once the node is sufficiently far along. The reason I've added a probe here is that I've been working on improving initial sync (or "IBD") time. By adding a probe right after the function latches to false, I can accurately measure the time from IBD start to IBD finish. As you might guess, the new lines of code I've added in my branch are the PROBE_FINISH_IBD_ENABLED() check and PROBE_FINISH_IBD() trigger right before the final return false. These macros are auto-generated by SystemTap as part of the build process.

The Generated x86 Assembly

I usually use GDB as my disassembler, so that's what we'll be looking at here. Note also that I'll be editing the disas output to only show instructions related to the probe logic, and will omit the surrounding code. Here's what the disassembled code for IsInitialBlockDownload() looks like:

$ gdb ./src/bitcoind

(gdb) disas /s IsInitialBlockDownload

1170	    if (PROBE_FINISH_IBD_ENABLED()) {
   0x00000000001a2954 <+420>:	cmpw   $0x0,0x4f6724(%rip)        # 0x699080 <bitcoin_finish_ibd_semaphore>
   0x00000000001a295c <+428>:	je     0x1a2831 <IsInitialBlockDownload()+129>

1171	        PROBE_FINISH_IBD();
   0x00000000001a2962 <+434>:	nop
   0x00000000001a2963 <+435>:	jmpq   0x1a2831 <IsInitialBlockDownload()+129>

Quick aside for a brief x86 primer. The disas output here uses AT&T syntax. x86 instructions can have variable length operands, and in AT&T syntax these are denoted with length suffixes appended to the instructions. So in this output, the w suffix on cmpw means a cmp instruction with a "word" (16-bit) operand. Similarly, jmpq means a jmp instruction with a "quad" (64-bit) operand.

The "if" statement compiles into a cmp that checks whether if memory address 0x699080 is numerically equal to 0. There's an annotation in the output telling us that 0x699080 is named bitcoin_finish_ibd_semaphore in the actual ELF file. If the value at that memory address is zero, the je instruction jumps past the body of the "if" statement. Otherwise instruction falls through to the nop instruction, which is a one byte no-op. After the nop is executed the code does an unconditional jmp.

Notice that both jumps (the je and the jmpq) both jump to the same location: 0x1a2831. It turns out this is the address of the final return false statement. GCC generally emits x86 code in a different order than the original source code for optimization purposes, which is why the instructions here can't just fall through to the return statement.

This code is a little strange because the flow control is like this:

Check the value of bitcoin_finish_ibd_semaphore
If it's non-zero, execute a no-op instruction, then jump to the final return false statement
If it's zero, immediately jump to the final return false statement

What's the point of conditionally executing a no-op statement, which by definition does nothing? If you're already familiar with how debugger like GDB work you might be able to guess; if not, read on.

A Look At Memory During SystemTap Tracing

First let's look at disas output for a running process. I've started a bitcoind process with PID 4077. When I execute the same disas as before, it looks slightly different:

$ gdb -p 4077

(gdb) disas /s IsInitialBlockDownload

1170	    if (PROBE_FINISH_IBD_ENABLED()) {
   0x000056352db8d954 <+420>:	cmpw   $0x0,0x4f6724(%rip)        # 0x56352e084080 <bitcoin_finish_ibd_semaphore>
   0x000056352db8d95c <+428>:	je     0x56352db8d831 <IsInitialBlockDownload()+129>

1171	        PROBE_FINISH_IBD();
   0x000056352db8d962 <+434>:	nop
   0x000056352db8d963 <+435>:	jmpq   0x56352db8d831 <IsInitialBlockDownload()+129>

The instructions are exactly the same as before, except for one thing: the memory addresses. The memory addresses have changed due to ASLR, which I've covered previously. They're essentially all offset by some randomized amount (0x56352d9eb000 in this example).

Let's also take a look at bitcoin_finish_ibd_semaphore:

(gdb) x/hx &bitcoin_finish_ibd_semaphore
0x56352e084080 <bitcoin_finish_ibd_semaphore>:	0x0000

Since this shows a value of 0x0000, we know the cmpw statement will always jump past the nop.

Attaching A SystemTap Script

Now I'm going to attach a SystemTap script that probes the point defined previously. Here's my script:

# This probe is run when stap initially attaches.
probe begin {
  println("attached to bitcoind process...")
}

# This probe is run when our IBD probe is triggered.
probe process("./src/bitcoind").mark("finish_ibd") {
  println("IBD finished")
}

The begin probe is helpful so I know when SystemTap has finished initializing, and has actually attached. I run the script like this:

# Run our stap script and attach to process 4077.
$ stap -x 4077 demo.stp
attached to bitcoind process...

Now let's disas the same area in memory again:

(gdb) disas /s IsInitialBlockDownload

1170	    if (PROBE_FINISH_IBD_ENABLED()) {
   0x000056352db8d954 <+420>:	cmpw   $0x0,0x4f6724(%rip)        # 0x56352e084080 <bitcoin_finish_ibd_semaphore>
   0x000056352db8d95c <+428>:	je     0x56352db8d831 <IsInitialBlockDownload()+129>

1171	        PROBE_FINISH_IBD();
   0x000056352db8d962 <+434>:	int3   
   0x000056352db8d963 <+435>:	jmpq   0x56352db8d831 <IsInitialBlockDownload()+129>

Things are different now! The nop instruction turned into int3. This is also known as the "trap" instruction. Here's how it works. When the processor executes this instruction, it interrupts the kernel. The kernel can handle this however it wants. Normally this will cause SIGTRAP to be delivered to the process. Debuggers like GDB use this mechanism to implement break points. SystemTap does something slightly different, which I'll explain in a moment.

What about the bitcoin_finish_ibd_semaphore instruction?

(gdb) x/hx &bitcoin_finish_ibd_semaphore
0x56352e084080 <bitcoin_finish_ibd_semaphore>:	0x0001

This has turned into 0x0001, instead of 0x0000 before. What this means is that when the program hits the cmpw instruction, it will now execute the body of the "if" statement, and that will interrupt the kernel.

How SystemTap Interrupts Work

When you run a SystemTap script, here's what happens:

The stap command transpiles the .stp code to C code for a Linux kernel module.
The C code is compiled to a native kernel module (a .ko file).
The kernel module registers interest in the trap event, and the probe definition you defined gets run as native x86 code from the kernel context.

If this sounds like a dangerous kernel execution vulnerability waiting to be discovered, it's because it is! For this reason only users belonging to certain privileged groups are allowed to execute SystemTap scripts. The reason SystemTap implements probes this way is because it makes them very fast.

What About Other Probe Points?

I ran some additional permutations of the above tests, and found out more about how the semaphores work. When SystemTap runs a script, it looks at which probe points you're actually monitoring. Only the probe points you're monitoring end up changed.

For example, elsewhere in the code I've added probe points to a method called CCoinsViewCache::FetchCoin. If I look at that function (for the same process, while my previous stap command is still attached), I can see that it's still unchanged, even though IsInitialBlockDownload() was modified by SystemTap:

(gdb) disas /s CCoinsViewCache::FetchCoin

49	    if (PROBE_CACHE_MISS_ENABLED() && m_enable_probing) {
   0x000056352dc2bea8 <+328>:	cmpb   $0x0,0x80(%rbx)
   0x000056352dc2beaf <+335>:	je     0x56352dc2bdd6 <CCoinsViewCache::FetchCoin(COutPoint const&) const+118>

49	    if (PROBE_CACHE_MISS_ENABLED() && m_enable_probing) {
   0x000056352dc2bdc8 <+104>:	cmpw   $0x0,0x4582ae(%rip)        # 0x56352e08407e <bitcoin_cache_miss_semaphore>
   0x000056352dc2bdd0 <+112>:	jne    0x56352dc2bea8 <CCoinsViewCache::FetchCoin(COutPoint const&) const+328>

50	        PROBE_CACHE_MISS();
   0x000056352dc2beb5 <+341>:	nop
   0x000056352dc2beb6 <+342>:	jmpq   0x56352dc2bdd6 <CCoinsViewCache::FetchCoin(COutPoint const&) const+118>

This code looks a little different because I'm also checking a member variable (meaning there are two cmp checks), but the basic flow is the same, and the "if" body looks identical. As you can see, this one still has a nop instruction which is unchanged.

As you'd expect, if I look at bitcoin_cache_miss_semaphore its value is zero:

(gdb) x/hx &bitcoin_cache_miss_semaphore
0x56352e08407e <bitcoin_cache_miss_semaphore>:	0x0000

What Happens When SystemTap Detaches?

If I subsequently detach my original stap process, the IsInitialBlockDownload() function is restored to its original state, and the bitcoin_finish_ibd_semaphore is also becomes zero. I won't reproduce the disas here because it's exactly the same as it was before I traced the process. In general whenever a semaphore count hits zero, its corresponding int3 instruction is reverted back to a nop.

How Much Overhead Do These Checks Incur?

When considering the overhead of SystemTap probes, we need to consider two cases: processes that are not actively being traced, and processes that are.

When a process is untraced, each probe point incurs the overhead of one cmp instruction and one jmp. Since these instructions are frequently found together, they end up being fused in modern processors. On a modern processor (e.g. Intel Skylake) the combination will take anywhere from 0.5 to 2 cycles (citation). My executive summary is: the overhead for an untraced process is effectively zero.

When a process is actively being traced, each probe point incurs the overhead of one cmp and a context switch. The details of context switch overhead depend on what kernel version and CPU you have, but the best resource I found (from 2010) shows an overhead of anywhere from 2 to 50 microseconds, depending on CPU affinity and some other details. The code defined in the probe point can potentially be quite complex (the SystemTap language allows you to run loops, invoke other functions, mutate hash maps, etc.). So the overhead in this case is potentially quite large. SystemTap actually accounts for this, and monitors how much time your probes are taking. If your probes take too long the stap script will be killed automatically. This can be tuned with stap command line options, e.g. if you have complex probe logic and you don't mind the overhead.

In any event, I think the overhead for the traced event is not that important, because it only happens when you're actively debugging an application. It won't be hit the vast majority of the time, and the important thing is just that the overhead doesn't interfere with timing results you get from the application itself.