I've been working on adding SystemTap probes to Bitcoin (see my [probes_sdt] (https://github.com/eklitzke/bitcoin/tree/probes_sdt) branch on GitHub). SystemTap is the Linux analog of DTrace (of BSD/macOS fame). You can define probe points in the kernel or in userspace applications, and then write scripts in the SystemTap language that watch for events on these probe points. The SystemTap scripting language is quite powerful, with loops, functions, and data structures like hash tables. The scripts you write get compiled and run as native code. One of the most compelling features of SystemTap is that these probe points incur very low overhead, so they can be added to code and left enabled, even for non-debugging binaries.
Adding SystemTap probes to Bitcoin made me curious about how these probes are implemented, and how much overhead they incur. In particular, I wanted to learn:
- What extra instructions are added to executables with SystemTap probes?
- What happens when you actually trace a process using
stap
? - What's the overhead in the untraced case?
- What's the overhead in the actively traced case?
There are some notes on the userspace probe implementation on the SystemTap wiki, but they weren't detailed enough for me. Thankfully I'm pretty handy with GDB, and was able to figure it by looking at the generated assembly code and process memory space. Here's what I learned.
A Look At Some Instrumented C++ Code
In this example I'll look at a probe I added to the Bitcoin
IsInitialBlockDownload()
function. This is a function that checks whether a
Bitcoin node is still in its initial data download phase, or whether the node is
"caught up" and fully synced. The source code for this function (with my
modifications) is reproduced below:
bool IsInitialBlockDownload()
{
// Once this function has returned false, it must remain false.
static std::atomic<bool> latchToFalse{false};
// Optimization: pre-test latch before taking the lock.
if (latchToFalse.load(std::memory_order_relaxed))
return false;
LOCK(cs_main);
if (latchToFalse.load(std::memory_order_relaxed))
return false;
if (fImporting || fReindex)
return true;
if (chainActive.Tip() == nullptr)
return true;
if (chainActive.Tip()->nChainWork < nMinimumChainWork)
return true;
if (chainActive.Tip()->GetBlockTime() < (GetTime() - nMaxTipAge))
return true;
LogPrintf("Leaving InitialBlockDownload (latching to false)\n");
latchToFalse.store(true, std::memory_order_relaxed);
if (PROBE_FINISH_IBD_ENABLED()) {
PROBE_FINISH_IBD();
}
return false;
}
This function checks various heuristics about how far along the chain is, and
eventually returns false once the node is sufficiently far along. The reason
I've added a probe here is that I've been working on improving initial sync (or
"IBD") time. By adding a probe right after the function latches to false, I can
accurately measure the time from IBD start to IBD finish. As you might guess,
the new lines of code I've added in my branch are the
PROBE_FINISH_IBD_ENABLED()
check and PROBE_FINISH_IBD()
trigger right before
the final return false
. These macros are auto-generated by SystemTap as part
of the build process.
The Generated x86 Assembly
I usually use GDB as my disassembler, so that's what we'll be looking at here.
Note also that I'll be editing the disas
output to only show instructions
related to the probe logic, and will omit the surrounding code. Here's what the
disassembled code for IsInitialBlockDownload()
looks like:
$ gdb ./src/bitcoind
(gdb) disas /s IsInitialBlockDownload
1170 if (PROBE_FINISH_IBD_ENABLED()) {
0x00000000001a2954 <+420>: cmpw $0x0,0x4f6724(%rip) # 0x699080 <bitcoin_finish_ibd_semaphore>
0x00000000001a295c <+428>: je 0x1a2831 <IsInitialBlockDownload()+129>
1171 PROBE_FINISH_IBD();
0x00000000001a2962 <+434>: nop
0x00000000001a2963 <+435>: jmpq 0x1a2831 <IsInitialBlockDownload()+129>
Quick aside for a brief x86 primer. The disas
output here uses AT&T syntax.
x86 instructions can have variable length operands, and in AT&T syntax these are
denoted with length suffixes appended to the instructions. So in this output,
the w
suffix on cmpw
means a cmp
instruction with a "word" (16-bit)
operand. Similarly, jmpq
means a jmp
instruction with a "quad" (64-bit)
operand.
The "if" statement compiles into a
cmp that checks whether if memory
address 0x699080 is numerically equal to 0. There's an annotation in the output
telling us that 0x699080 is named bitcoin_finish_ibd_semaphore
in the actual
ELF file. If the value at that memory address is zero, the
je instruction jumps past the body
of the "if" statement. Otherwise instruction falls through to the
nop instruction, which is a one
byte no-op. After the nop
is executed the code does an unconditional
jmp.
Notice that both jumps (the je
and the jmpq
) both jump to the same location:
0x1a2831. It turns out this is the address of the final return false
statement. GCC generally emits x86 code in a different order than the original
source code for optimization purposes, which is why the instructions here
can't just fall through to the return statement.
This code is a little strange because the flow control is like this:
- Check the value of
bitcoin_finish_ibd_semaphore
- If it's non-zero, execute a no-op instruction, then jump to the final
return false
statement - If it's zero, immediately jump to the final
return false
statement
What's the point of conditionally executing a no-op statement, which by definition does nothing? If you're already familiar with how debugger like GDB work you might be able to guess; if not, read on.
A Look At Memory During SystemTap Tracing
First let's look at disas
output for a running process. I've started a
bitcoind
process with PID 4077. When I execute the same disas
as before, it
looks slightly different:
$ gdb -p 4077
(gdb) disas /s IsInitialBlockDownload
1170 if (PROBE_FINISH_IBD_ENABLED()) {
0x000056352db8d954 <+420>: cmpw $0x0,0x4f6724(%rip) # 0x56352e084080 <bitcoin_finish_ibd_semaphore>
0x000056352db8d95c <+428>: je 0x56352db8d831 <IsInitialBlockDownload()+129>
1171 PROBE_FINISH_IBD();
0x000056352db8d962 <+434>: nop
0x000056352db8d963 <+435>: jmpq 0x56352db8d831 <IsInitialBlockDownload()+129>
The instructions are exactly the same as before, except for one thing: the memory addresses. The memory addresses have changed due to ASLR, which I've covered previously. They're essentially all offset by some randomized amount (0x56352d9eb000 in this example).
Let's also take a look at bitcoin_finish_ibd_semaphore
:
(gdb) x/hx &bitcoin_finish_ibd_semaphore
0x56352e084080 <bitcoin_finish_ibd_semaphore>: 0x0000
Since this shows a value of 0x0000, we know the cmpw
statement will always
jump past the nop
.
Attaching A SystemTap Script
Now I'm going to attach a SystemTap script that probes the point defined previously. Here's my script:
# This probe is run when stap initially attaches.
probe begin {
println("attached to bitcoind process...")
}
# This probe is run when our IBD probe is triggered.
probe process("./src/bitcoind").mark("finish_ibd") {
println("IBD finished")
}
The begin
probe is helpful so I know when SystemTap has finished initializing,
and has actually attached. I run the script like this:
# Run our stap script and attach to process 4077.
$ stap -x 4077 demo.stp
attached to bitcoind process...
Now let's disas
the same area in memory again:
(gdb) disas /s IsInitialBlockDownload
1170 if (PROBE_FINISH_IBD_ENABLED()) {
0x000056352db8d954 <+420>: cmpw $0x0,0x4f6724(%rip) # 0x56352e084080 <bitcoin_finish_ibd_semaphore>
0x000056352db8d95c <+428>: je 0x56352db8d831 <IsInitialBlockDownload()+129>
1171 PROBE_FINISH_IBD();
0x000056352db8d962 <+434>: int3
0x000056352db8d963 <+435>: jmpq 0x56352db8d831 <IsInitialBlockDownload()+129>
Things are different now! The nop
instruction turned into
int3. This is also known
as the "trap" instruction. Here's how it works. When the processor executes this
instruction, it interrupts the kernel. The kernel can handle this however it
wants. Normally this will cause SIGTRAP
to be delivered to the process.
Debuggers like GDB use this mechanism to implement break points. SystemTap does
something slightly different, which I'll explain in a moment.
What about the bitcoin_finish_ibd_semaphore
instruction?
(gdb) x/hx &bitcoin_finish_ibd_semaphore
0x56352e084080 <bitcoin_finish_ibd_semaphore>: 0x0001
This has turned into 0x0001, instead of 0x0000 before. What this means is that
when the program hits the cmpw
instruction, it will now execute the body of
the "if" statement, and that will interrupt the kernel.
How SystemTap Interrupts Work
When you run a SystemTap script, here's what happens:
- The
stap
command transpiles the.stp
code to C code for a Linux kernel module. - The C code is compiled to a native kernel module (a
.ko
file). - The kernel module registers interest in the trap event, and the probe definition you defined gets run as native x86 code from the kernel context.
If this sounds like a dangerous kernel execution vulnerability waiting to be discovered, it's because it is! For this reason only users belonging to certain privileged groups are allowed to execute SystemTap scripts. The reason SystemTap implements probes this way is because it makes them very fast.
What About Other Probe Points?
I ran some additional permutations of the above tests, and found out more about how the semaphores work. When SystemTap runs a script, it looks at which probe points you're actually monitoring. Only the probe points you're monitoring end up changed.
For example, elsewhere in the code I've added probe points to a method called
CCoinsViewCache::FetchCoin
. If I look at that function (for the same process,
while my previous stap
command is still attached), I can see that it's still
unchanged, even though IsInitialBlockDownload()
was modified by SystemTap:
(gdb) disas /s CCoinsViewCache::FetchCoin
49 if (PROBE_CACHE_MISS_ENABLED() && m_enable_probing) {
0x000056352dc2bea8 <+328>: cmpb $0x0,0x80(%rbx)
0x000056352dc2beaf <+335>: je 0x56352dc2bdd6 <CCoinsViewCache::FetchCoin(COutPoint const&) const+118>
49 if (PROBE_CACHE_MISS_ENABLED() && m_enable_probing) {
0x000056352dc2bdc8 <+104>: cmpw $0x0,0x4582ae(%rip) # 0x56352e08407e <bitcoin_cache_miss_semaphore>
0x000056352dc2bdd0 <+112>: jne 0x56352dc2bea8 <CCoinsViewCache::FetchCoin(COutPoint const&) const+328>
50 PROBE_CACHE_MISS();
0x000056352dc2beb5 <+341>: nop
0x000056352dc2beb6 <+342>: jmpq 0x56352dc2bdd6 <CCoinsViewCache::FetchCoin(COutPoint const&) const+118>
This code looks a little different because I'm also checking a member variable
(meaning there are two cmp
checks), but the basic flow is the same, and the
"if" body looks identical. As you can see, this one still has a nop
instruction which is unchanged.
As you'd expect, if I look at bitcoin_cache_miss_semaphore
its value is zero:
(gdb) x/hx &bitcoin_cache_miss_semaphore
0x56352e08407e <bitcoin_cache_miss_semaphore>: 0x0000
What Happens When SystemTap Detaches?
If I subsequently detach my original stap
process, the
IsInitialBlockDownload()
function is restored to its original state, and the
bitcoin_finish_ibd_semaphore
is also becomes zero. I won't reproduce the
disas
here because it's exactly the same as it was before I traced the
process. In general whenever a semaphore count hits zero, its corresponding
int3
instruction is reverted back to a nop
.
How Much Overhead Do These Checks Incur?
When considering the overhead of SystemTap probes, we need to consider two cases: processes that are not actively being traced, and processes that are.
When a process is untraced, each probe point incurs the overhead of one cmp
instruction and one jmp
. Since these instructions are frequently found
together, they end up being fused in modern processors. On a modern processor
(e.g. Intel Skylake) the combination will take anywhere from 0.5 to 2 cycles
(citation). My executive
summary is: the overhead for an untraced process is effectively zero.
When a process is actively being traced, each probe point incurs the overhead of
one cmp
and a context switch. The details of context switch overhead depend on
what kernel version and CPU you have, but the best resource I
found
(from 2010) shows an overhead of anywhere from 2 to 50 microseconds, depending
on CPU affinity and some other details. The code defined in the probe point can
potentially be quite complex (the SystemTap language allows you to run loops,
invoke other functions, mutate hash maps, etc.). So the overhead in this case is
potentially quite large. SystemTap actually accounts for this, and monitors how
much time your probes are taking. If your probes take too long the stap
script
will be killed automatically. This can be tuned with stap
command line
options, e.g. if you have complex probe logic and you don't mind the overhead.
In any event, I think the overhead for the traced event is not that important, because it only happens when you're actively debugging an application. It won't be hit the vast majority of the time, and the important thing is just that the overhead doesn't interfere with timing results you get from the application itself.