Using Ptrace For Fun And Profit

I've been working on a project at work where I am implementing a heap profiler (written in C) for Python. It works in a way that is very similar to the tracemalloc project. The main difference is that my implementation does not require patching the Python interpreter. The tracemalloc project works by changing the definition of struct PyObject to embed a pointer to the allocation information. This has the unfortunate effect of changing the Python ABI (e.g. Python wheels that contain compiled C code need to be recompiled). My implementation avoids changing the struct PyObject definition and therefore maintains ABI compatibility, but comes at the cost of being slightly less memory/CPU efficient. I wrote briefly about the technique I'm using in an earlier blog post. Hopefully I'll be able to write more about the project later (or even get it open sourced).

The first version of my profiler worked by installing a signal handler for SIGUSR1. When the signal handler ran it would dump all of the heap profiling information to a file in /tmp. This works, but it has a couple of limitations:

The original implementation worked by overwriting Python's memory allocation routines using LD_PRELOAD. I got requests to be able to attach to an already running process and start tracking memory information from that point on. This is something that you can't do with LD_PRELOAD since it has to be present at the time the process is started.

Someone advised that I look into the ptrace() system call. The ptrace system call is how GDB works on Linux (and other Unix systems). The ptrace system call is also how strace is implemented. The way it works is that the process that calls ptrace() (which is referred to as the "tracer") attaches to another process (which is referred to as the "tracee"). The tracer then has essentially unlimited power to control, modify, and alter the tracee. Here is an incomplete list of some of the things the tracer can do:

Basically this system call would let me do what I want. Using ptrace you can dynamically attach to a process and patch the PLT and GOT to point to your own custom methods which effectively emulates what LD_PRELOAD does. Using ptrace you can also control the tracee to send in information. For instance, you can invoke a memory dump routine and supply that routine with any arguments you want, e.g. a filename, flags, etc.

Black Magic

In theory, this is all pretty straightforward. In pratice, it is anything but.

First of all, to do anything remotely interesting with ptrace you will need to know some assembler (x86 in my case). This is because if you want to run code in the tracee you will need to inject that code into the process. This means at the very minimum you need to know how to either generate a CALL instruction to call a userspace method or SYSENTER to make a system call. In both cases you need to know the calling convention, i.e. what registers need to be set to make a function call (and possibly what you'll need to push onto the stack).

Let's say you want your tracer to call a userspace method in the tracee. This is surprisingly difficult.

To call a method, you need to know it's address. When you're writing your own code that's easy to do:

int x;
printf("address of x is %p\n", &x);
printf("address of printf is %p\n", printf);

However, you can't do this with the tracee. Normally what happens when you run a program is that any methods that are part of your code are put into a fixed location when the linker links your object code. Methods that are part of shared libraries are linked into your code at runtime by ld.so when your process starts up. What literally happens is ld.so scans through the object code in your process and looks for all of the references to library calls (e.g. printf()) and then modifies the x86 code in the memory space of your process so that all of your CALL and JMP instructions go to the right place.

In fact, it gets even more complicated. For security reasons Linux implements this thing called ASLR. This means that when a shared library is loaded it's put into a random memory location. This is done intentionally to make it difficult to find and call arbitrary methods. It has a good purpose: it's meant to improve the the security of the system, so that attacks like return to libc are more difficult. But it makes it really hard to call methods using ptrace.

There is also really no programmatic way to find and enumerate methods in C. In theory what you can do is parse the ELF data from the binary you're attached to and its shared libraries, but in practice this is so complex that it's not even worth trying. Tools like nm, objdump, and gdb actually do this. But for regular mortals it is unreasonably complicated to try to do this at runtime in a real program.

However, not all is lost. If you want to call methods in the tracee that are compiled in as part of its source code (i.e. that are static to the executable) you can disassemble the binary and hardcode those locations into your ptracer program. This is kind of awful because it means that you will have to update the hardcoded constants any time the binary changes (e.g. it's updated or recompiled with new flags), but it does work.

If you want to call methods in shared libraries, I found a hack that lets you figure out how to find the locations of methods. Here's how it works. Let's say you want to call a method defined in libc. For our example we'll consider calling the fprintf() method. You look at the file /proc/<pid>/maps for the other process, and that file will tell you where the ASLR decided to actually load the library. This will only work if you have the same user id as that process, or if you are root, since you're not allowed to look at the memory mapping for arbitrary processes as an unprivileged user. Then you look at /proc/self/maps and again, find where your process decided to load libc. Then what you do is in your own process you take the address of fprintf() and subtract from it the address of where libc was loaded in your process. This will give you the number of bytes past the start of libc where fprintf() is defined. Let's say, for example, that when you do this you get 20544.

Now that you know htat fprintf() is defined 20544 bytes after the start of libc, you can guess that fprintf() in the tracee is located 20544 bytes after that process' libc. So using the libc start you found from looking at the proc maps file, you can compute the address of fprintf() in the tracee!

You can also use this technique for other arbitrary shared libraries by just linking your tracer against them. For instance, let's say I want my tracer to remotely call a method that's defined by Python, such as PyObject_Malloc(). If I link my ptracer using -lpython2.7 or -lpython3.4m then I can use the same trick to find where a running instance of /usr/bin/python has PyObject_Malloc() loaded.

There is one major caveat here. On Linux you can update shared libraries at any time. Old processes that have already loaded the shared library will continue to have the previous version of the library loaded into memory. So if you try this technique against a process that's been running for a long time (say, weeks or months) it's possible that a library method you want to call might have changed locations if the shared library has been updated since the tracee was started.

Show Me The Code!

I put some code up on GitHub that shows exactly how to employ this technique. The project is at eklitzke/ptrace-call-userspace and if you want to just dive directly into the code look at call_fprintf.c. The codes runs the equivalent of:

fprintf(stderr, "instruction pointer = %p\n", rip);

in the traced process, where rip is the value of the instruction pointer when the process was initially attached by the tracer.

If you look at the code you can see that it's actually pretty short, and overall isn't that complicated. However, it took me a lot of time to actually get this working. You can't use GDB with a program you are ptrace attached to, since only one process is allowed to ptrace at a time. So debugging was more difficult than with a normal C program: you're limited to debugigng core dumps if you crash the tracee. If you make a mistake, you're likely to not just get the pedestrian SIGSEGV, but the much more exciting SIGILL which means that you generated illegal x86 code! I also had effectively zero experience with x86 before starting this project. There's not even a lot of x86 in this project, but I spent a lot of time disassembling other C programs to understand how they work. I also found the whole CALL thing in x86 confusing since the target address is encoded in a weird way. You don't call an absolute address like 0x7fd3edb64000, instead you have to compute the difference between the current value of the instruction pointer, your call target, and then take into account the actual size of the CALL instruction (5 bytes for a rel32 call) when you encode the instruction.

I had a lot of fun writing this and felt like I learned a lot about reading disassembled code. I also have a newfound appreciation for the objdump command and all of the amazing things you can do with it, and I learned a lot of fun new GDB commands for looking at register state and frobbing memory and whatnot. I hope other people find this example code useful in their own projects.