LD_PRELOAD Hacks

I did some fun things tonight with LD_PRELOAD and GDB that made me feel a bit like a real hacker.

Some Context

I'm trying to identify memory bloat in Python applications. If you're familiar with Python you know that there are already a bunch of heap profiling tools for Python, and they all kind of suck. They are all implemented using the gc module which gives you three useful functions:

Pretty much all of the existing heap profiling tools do something where you periodically inspect the heap and generate a count of all of the object types and/or the references between the objects. Then you're supposed to look at this output and figure out "oh it looks like there are a lot of SomeClass objects allocated" and then maybe look at the object references and try to figure out why that's happening.

In a pinch this will work, but it's really hard to use. A much better way to do heap profiling is how the Valgrind massif tool does it. What massif will do is tell you, at any given point in time, where all of the allocated memory was actually allocated. It will track the full stack for each allocation. The difference here is huge: you can look at the massif output and know "20% of my memory was allocated in foo.cc:214" or "30% of the memory was allocated by libleak.so" and then get an idea of where in your code to actually start looking to understand the problem. The Python tools don't give this to you, because the Python tools will tell you that you have a lot of dicts and lists and ints and whatnot, but won't tell you where they actually came from in your code, i.e. what lines of code you should be looking at.

Actually, some smart people knew about this a long time ago and came up with the tracemalloc tool. If you are lucky enough to be running Python 3.4 or later this feature is built into Python. If you're on Python 2.x or an earlier Python 3.x release you have to patch your Python interpreter to use tracemalloc.

The way tracemalloc works is it extends the struct PyObject type to have a pointer to some allocation information. This is the natural way to implement heap tracking. It's also the most efficient way to do it. However, there's a serious drawback: because it changes the size of a struct PyObject it will break ABI compatibility with pre-compiled C libraries linking against Python code. So if your code uses a library like mysqldb or numpy you'll need to rebuild those libraries from source.

I had another idea. My idea was to use LD_PRELOAD to load in a patched version of the following three methods:

Then what I would do is dlopen() libpython2.7.so and have my versions call into the libpython2.7.so defined implementations. My versions can then track the information I care about (size and some basic Python stack information) for pointers allocated/freed by these functions. The information is stored in a hash table keyed on void * and mapping to a pointer of my custom allocation struct type.

This is quite a bit less efficient than the tracemalloc approach because it has to use a hash table rather than structural embedding, so it uses more memory and uses more CPU time. However, it's nice because you don't need to patch Python (scary), you don't change the Python ABI, and if you don't LD_PRELOAD the patched versions there is literally zero runtime overhead (since you're running an unpatched Python).

The Problem

I got this all working pretty easily on my Fedora 23 workstation using GLib and GHashTable to implement the hash table data structure. I ran into problems when running it on Ubuntu 12.04 (a.k.a. "Precise").

The first problem I had is that /usr/bin/python2.7 on Fedora is actually a tiny (~8Kb) binary that links against libpython2.7.so. It does not link against libpython2.7.so on Ubuntu; instead, Ubuntu statically compiles the Python core into the python binary. It turns out that this behavior is toggled in the Python build chain in the configure script. If you want the dynamic linking behavior (as I do) you want to run configure like:

./configure --enable-shared

You can then confirm that the binary you build links against libpython2.7.so by examining the output of ldd.

Once I got this working, I ran into another problem. I was able to build my preload DSO fine on Ubuntu, but when I actually tried to load it with LD_PRELOAD I'd get errors about unresolved symbols related to GLib. ELF binaries/libraries have a DT_NEEDED section which lists what libraries they depend on. This was working correctly on Fedora, but not on Ubuntu. From my understanding of the man pages from the system I had thought I had needed -Wl,--copy-dt-needed-entries. It turns out that I actually needed -Wl,--no-as-needed. The names of these options and the default Debian/Ubuntu behavior has changed multiple times over the last few years so it's kind of unclear for exactly which ld and Debian/Ubuntu release what options are needed, I just ended up using both in the Makefile.

I got this working and saw that my preload DSO could start up, initialize itself, initialize the GLib hash table, and all that. Great. I could not, however, get the python binary that I compiled to actually use the PyObject_* symbols exported by my DSO.

I eventually solved the problem using some GDB hacks. I'll explain:

evan@azrael ~ $ export LD_PRELOAD=pytraceobject/libpytraceobject.so

evan@azrael ~ $ gdb ./Python-2.7.11/python
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.launchpad.net/gdb-linaro/>...
Reading symbols from /home/evan/Python-2.7.11/python...done.

Here I've exported my DSO and started GDB with the python binary I compiled. Great. Now what I want to do is have it run until main starts so I can see what symbols are loaded:

(gdb) break main
Breakpoint 1 at 0x400630: file ./Modules/python.c, line 23.

(gdb) run
Starting program: /home/evan/Python-2.7.11/python
warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7ffff7ffa000
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Breakpoint 1, main (argc=1, argv=0x7fffffffea98) at ./Modules/python.c:23
23              return Py_Main(argc, argv);

Here you can see we've entered main(). Let's look up PyObject_Malloc:

(gdb) info functions PyObject_Malloc
All functions matching regular expression "PyObject_Malloc":

Non-debugging symbols:
0x00007ffff7bd7daa  PyObject_Malloc
0x00007ffff7809cd0  PyObject_Malloc

There are two symbols, which is what we expect. I inspected the process with pmap and confirmed that the address for the first one listed is from the DSO I created, and the second one is from libpython2.7.so.

(gdb) info addr PyObject_Malloc
Symbol "PyObject_Malloc" is at 0x00007ffff7bd7daa in a file compiled without debugging.

Here GDB is saying that it thinks PyObject_Malloc resolves to the one in my DSO. OK, let's set a break point and see what actually happens:

(gdb) break PyObject_Malloc
Breakpoint 2 at 0x7ffff7bd7dae (2 locations)

(gdb) c
Continuing.

Breakpoint 2, 0x00007ffff7809cd0 in PyObject_Malloc () from /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0

We can see here that the version it actually stopped at is the one loaded by /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0. You can see exactly why this is by disassembling the call site:

(gdb) up
#1  0x00007ffff77ea4c9 in _PyObject_GC_Malloc () from /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0

(gdb) disas 0x00007ffff77ea4c9
Dump of assembler code for function _PyObject_GC_Malloc:
   0x00007ffff77ea4b0 <+0>:     movabs $0x7fffffffffffffdf,%rax
   0x00007ffff77ea4ba <+10>:    push   %rbx
   0x00007ffff77ea4bb <+11>:    cmp    %rax,%rdi
   0x00007ffff77ea4be <+14>:    ja     0x7ffff77ea520 <_PyObject_GC_Malloc+112>
   0x00007ffff77ea4c0 <+16>:    add    $0x20,%rdi
   0x00007ffff77ea4c4 <+20>:    callq  0x7ffff7809cd0 <PyObject_Malloc>
=> 0x00007ffff77ea4c9 <+25>:    test   %rax,%rax
   0x00007ffff77ea4cc <+28>:    mov    %rax,%rbx
   0x00007ffff77ea4cf <+31>:    je     0x7ffff77ea520 <_PyObject_GC_Malloc+112>
   0x00007ffff77ea4d1 <+33>:    movq   $0xfffffffffffffffe,0x10(%rax)
   0x00007ffff77ea4d9 <+41>:    mov    0x3763e5(%rip),%eax        # 0x7ffff7b608c4
   0x00007ffff77ea4df <+47>:    mov    0x3763db(%rip),%edx        # 0x7ffff7b608c0
   0x00007ffff77ea4e5 <+53>:    add    $0x1,%eax
   0x00007ffff77ea4e8 <+56>:    cmp    %edx,%eax
   0x00007ffff77ea4ea <+58>:    mov    %eax,0x3763d4(%rip)        # 0x7ffff7b608c4
   0x00007ffff77ea4f0 <+64>:    jle    0x7ffff77ea510 <_PyObject_GC_Malloc+96>

The important thing here is that it's doing a direct jump. It's not going through the PLT.

In hindsight, the problem was pretty obvious. Since the system libpython2.7.so wasn't compiled to be a shared library it won't use the PLT to do symbol lookups. That means that if you call into it, all of the Python symbols will be resolved within that DSO as if they are static. While I was debugging this I was aware from the beginning that my custom binary was using the system libpython2.7.so and not the one I had compiled. However, due to my relatively weak knowledge about how dynamic linking actually works I had assumed it would not be an issue since I merely wanted a binary that would invoke Python using libpython2.7.so so I could override symbols via LD_PRELOAD.

In fact, the system version is itself a relocatable shared object. The problem is that it's compiled with -shared but not -fpic. This means that all of the symbol relocations internally are resolved at load time. When compiling with -fpic (or -fPIC) the code will be fully position independent, meaning that all function calls will be made via the PLT. Eli Bendersky has two great articles on this topic, covering the difference between load-time relocation of shared libraries and position independent code in shared libraries. When building Python with ./configure --enable-shared you're actually instructing Python to generate fully position independent code.

This is easily fixable in my case by using LD_LIBRARY_PATH and setting it so that the custom python binary I built will prefer the libpython2.7.so that I built, which is compiled to be a PIC shared library. In a real production deployment we would set up the dynamic linker to not need this, but for debugging it's fine. Let me show you, by comparison, what the GDB output looks like when using the shared libpython2.7.so:

Dump of assembler code for function _PyObject_GC_Malloc:
   0x00007ffff7b267b0 <+0>:     movabs $0x7fffffffffffffdf,%rax
   0x00007ffff7b267ba <+10>:    push   %rbx
   0x00007ffff7b267bb <+11>:    cmp    %rax,%rdi
   0x00007ffff7b267be <+14>:    ja     0x7ffff7b26820 <_PyObject_GC_Malloc+112>
   0x00007ffff7b267c0 <+16>:    add    $0x20,%rdi
   0x00007ffff7b267c4 <+20>:    callq  0x7ffff7a35430 <PyObject_Malloc@plt>
=> 0x00007ffff7b267c9 <+25>:    test   %rax,%rax
   0x00007ffff7b267cc <+28>:    mov    %rax,%rbx
   0x00007ffff7b267cf <+31>:    je     0x7ffff7b26820 <_PyObject_GC_Malloc+112>
   0x00007ffff7b267d1 <+33>:    movq   $0xfffffffffffffffe,0x10(%rax)
   0x00007ffff7b267d9 <+41>:    mov    0x297da5(%rip),%eax        # 0x7ffff7dbe584 <generations+36>
   0x00007ffff7b267df <+47>:    mov    0x297d9b(%rip),%edx        # 0x7ffff7dbe580 <generations+32>
   0x00007ffff7b267e5 <+53>:    add    $0x1,%eax
   0x00007ffff7b267e8 <+56>:    cmp    %edx,%eax
   0x00007ffff7b267ea <+58>:    mov    %eax,0x297d94(%rip)        # 0x7ffff7dbe584 <generations+36>
   0x00007ffff7b267f0 <+64>:    jle    0x7ffff7b26810 <_PyObject_GC_Malloc+96>

You can see here insetad of a regular jump it's doing:

callq  0x7ffff7a35430 <PyObject_Malloc@plt>

This means that it's going to resolve the symbol using the dynamic linker table, i.e. via the PLT. And indeed it works. If you want to learn about the vagaries of how the PLT works, you can try running a dynamically linked binary (e.g. /bin/ls) on your local system and then disassembling any call site that goes through the PLT. In particular, I found this online tutorial helpful to getting me to the above point in my GDB analysis.

Tomorrow I'll have fun extending my DSO to be a bit more interesting by recording actual tracing information. I'll also try to benchmark this approach to see how impactful the hash table is.