I did some fun things tonight with
LD_PRELOAD and GDB that made me feel
a bit like a real hacker.
I’m trying to identify memory bloat in Python applications. If you’re familiar
with Python you know that there are already a bunch of heap profiling tools for
Python, and they all kind of suck. They are all implemented using the
gc module which gives you three
gc.get_objects() gives you a list of all live objects in the heap
gc.get_referrers() lets you figure out what has a reference to an object
gc.get_referents() lets you figure out what references an object has
Pretty much all of the existing heap profiling tools do something where you
periodically inspect the heap and generate a count of all of the object types
and/or the references between the objects. Then you’re supposed to look at this
output and figure out “oh it looks like there are a lot of
allocated” and then maybe look at the object references and try to figure out
why that’s happening.
In a pinch this will work, but it’s really hard to use. A much better way to do
heap profiling is how the
Valgrind massif tool does it.
What massif will do is tell you, at any given point in time, where all of the
allocated memory was actually allocated. It will track the full stack for each
allocation. The difference here is huge: you can look at the massif output and
know “20% of my memory was allocated in
foo.cc:214” or “30% of the memory was
libleak.so” and then get an idea of where in your code to
actually start looking to understand the problem. The Python tools don’t give
this to you, because the Python tools will tell you that you have a lot of dicts
and lists and ints and whatnot, but won’t tell you where they actually came from
in your code, i.e. what lines of code you should be looking at.
Actually, some smart people knew about this a long time ago and came up with the
tracemalloc tool. If you are lucky
enough to be running Python 3.4 or later this feature is built into Python. If
you’re on Python 2.x or an earlier Python 3.x release you have to patch your
Python interpreter to use tracemalloc.
The way tracemalloc works is it extends the
struct PyObject type to have a
pointer to some allocation information. This is the natural way to implement
heap tracking. It’s also the most efficient way to do it. However, there’s a
serious drawback: because it changes the size of a
struct PyObject it will
break ABI compatibility with pre-compiled C libraries linking against Python
code. So if your code uses a library like
numpy you’ll need to
rebuild those libraries from source.
I had another idea. My idea was to use
LD_PRELOAD to load in a patched version
of the following three methods:
Then what I would do is
libpython2.7.so and have my versions call into the
libpython2.7.so defined implementations. My versions can then track the
information I care about (size and some basic Python stack information) for
pointers allocated/freed by these functions. The information is stored in a hash
table keyed on
void * and mapping to a pointer of my custom allocation struct
This is quite a bit less efficient than the tracemalloc approach because it has
to use a hash table rather than structural embedding, so it uses more memory and
uses more CPU time. However, it’s nice because you don’t need to patch Python
(scary), you don’t change the Python ABI, and if you don’t
patched versions there is literally zero runtime overhead (since you’re running
an unpatched Python).
I got this all working pretty easily on my Fedora 23 workstation using GLib and
GHashTable to implement the hash table data structure. I ran into problems when
running it on Ubuntu 12.04 (a.k.a. “Precise”).
The first problem I had is that
/usr/bin/python2.7 on Fedora is actually a
tiny (~8Kb) binary that links against
libpython2.7.so. It does not
libpython2.7.so on Ubuntu; instead, Ubuntu statically compiles
the Python core into the
python binary. It turns out that this behavior is
toggled in the Python build chain in the configure script. If you want the
dynamic linking behavior (as I do) you want to run configure like:
You can then confirm that the binary you build links against
by examining the output of
Once I got this working, I ran into another problem. I was able to build my
preload DSO fine on Ubuntu, but when I actually tried to load it with
LD_PRELOAD I’d get errors about unresolved symbols related to GLib. ELF
binaries/libraries have a
DT_NEEDED section which lists what libraries they
depend on. This was working correctly on Fedora, but not on Ubuntu. From my
understanding of the man pages from the system I had thought I had needed
-Wl,--copy-dt-needed-entries. It turns out that I actually needed
-Wl,--no-as-needed. The names of these options and the default Debian/Ubuntu
behavior has changed multiple times over the last few years so it’s kind of
unclear for exactly which
ld and Debian/Ubuntu release what options are
needed, I just ended up using both in the Makefile.
I got this working and saw that my preload DSO could start up, initialize
itself, initialize the GLib hash table, and all that. Great. I could not,
however, get the
python binary that I compiled to actually use the
PyObject_* symbols exported by my DSO.
I eventually solved the problem using some GDB hacks. I’ll explain:
evan@azrael ~ $ export LD_PRELOAD=pytraceobject/libpytraceobject.so
evan@azrael ~ $ gdb ./Python-2.7.11/python
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
Reading symbols from /home/evan/Python-2.7.11/python...done.
Here I’ve exported my DSO and started GDB with the python binary I compiled.
Great. Now what I want to do is have it run until main starts so I can see what
symbols are loaded:
(gdb) break main
Breakpoint 1 at 0x400630: file ./Modules/python.c, line 23.
Starting program: /home/evan/Python-2.7.11/python
warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7ffff7ffa000
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Breakpoint 1, main (argc=1, argv=0x7fffffffea98) at ./Modules/python.c:23
23 return Py_Main(argc, argv);
Here you can see we’ve entered
main(). Let’s look up
(gdb) info functions PyObject_Malloc
All functions matching regular expression "PyObject_Malloc":
There are two symbols, which is what we expect. I inspected the process with
pmap and confirmed that the address for the first one listed is from the DSO
I created, and the second one is from
(gdb) info addr PyObject_Malloc
Symbol "PyObject_Malloc" is at 0x00007ffff7bd7daa in a file compiled without debugging.
Here GDB is saying that it thinks
PyObject_Malloc resolves to the one in my
DSO. OK, let’s set a break point and see what actually happens:
(gdb) break PyObject_Malloc
Breakpoint 2 at 0x7ffff7bd7dae (2 locations)
Breakpoint 2, 0x00007ffff7809cd0 in PyObject_Malloc () from /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0
We can see here that the version it actually stopped at is the one loaded by
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0. You can see exactly why this
is by disassembling the call site:
#1 0x00007ffff77ea4c9 in _PyObject_GC_Malloc () from /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0
(gdb) disas 0x00007ffff77ea4c9
Dump of assembler code for function _PyObject_GC_Malloc:
0x00007ffff77ea4b0 <+0>: movabs $0x7fffffffffffffdf,%rax
0x00007ffff77ea4ba <+10>: push %rbx
0x00007ffff77ea4bb <+11>: cmp %rax,%rdi
0x00007ffff77ea4be <+14>: ja 0x7ffff77ea520 <_PyObject_GC_Malloc+112>
0x00007ffff77ea4c0 <+16>: add $0x20,%rdi
0x00007ffff77ea4c4 <+20>: callq 0x7ffff7809cd0 <PyObject_Malloc>
=> 0x00007ffff77ea4c9 <+25>: test %rax,%rax
0x00007ffff77ea4cc <+28>: mov %rax,%rbx
0x00007ffff77ea4cf <+31>: je 0x7ffff77ea520 <_PyObject_GC_Malloc+112>
0x00007ffff77ea4d1 <+33>: movq $0xfffffffffffffffe,0x10(%rax)
0x00007ffff77ea4d9 <+41>: mov 0x3763e5(%rip),%eax # 0x7ffff7b608c4
0x00007ffff77ea4df <+47>: mov 0x3763db(%rip),%edx # 0x7ffff7b608c0
0x00007ffff77ea4e5 <+53>: add $0x1,%eax
0x00007ffff77ea4e8 <+56>: cmp %edx,%eax
0x00007ffff77ea4ea <+58>: mov %eax,0x3763d4(%rip) # 0x7ffff7b608c4
0x00007ffff77ea4f0 <+64>: jle 0x7ffff77ea510 <_PyObject_GC_Malloc+96>
The important thing here is that it’s doing a direct jump. It’s not going
through the PLT.
In hindsight, the problem was pretty obvious. Since the system
wasn’t compiled to be a shared library it won’t use the PLT to do symbol
lookups. That means that if you call into it, all of the Python symbols will be
resolved within that DSO as if they are static. While I was debugging this I was
aware from the beginning that my custom binary was using the system
libpython2.7.so and not the one I had compiled. However, due to my relatively
weak knowledge about how dynamic linking actually works I had assumed it would
not be an issue since I merely wanted a binary that would invoke Python using
libpython2.7.so so I could override symbols via
In fact, the system version is itself a relocatable shared object. The problem
is that it’s compiled with
-shared but not
-fpic. This means that all of the
symbol relocations internally are resolved at load time. When compiling with
-fPIC) the code will be fully position independent, meaning that
all function calls will be made via the PLT. Eli Bendersky has two great
articles on this topic, covering the difference between
load-time relocation of shared libraries
position independent code in shared libraries.
When building Python with
./configure --enable-shared you’re actually
instructing Python to generate fully position independent code.
This is easily fixable in my case by using
LD_LIBRARY_PATH and setting it so
that the custom
python binary I built will prefer the
libpython2.7.so that I
built, which is compiled to be a PIC shared library. In a real production
deployment we would set up the dynamic linker to not need this, but for
debugging it’s fine. Let me show you, by comparison, what the GDB output looks
like when using the shared
Dump of assembler code for function _PyObject_GC_Malloc:
0x00007ffff7b267b0 <+0>: movabs $0x7fffffffffffffdf,%rax
0x00007ffff7b267ba <+10>: push %rbx
0x00007ffff7b267bb <+11>: cmp %rax,%rdi
0x00007ffff7b267be <+14>: ja 0x7ffff7b26820 <_PyObject_GC_Malloc+112>
0x00007ffff7b267c0 <+16>: add $0x20,%rdi
0x00007ffff7b267c4 <+20>: callq 0x7ffff7a35430 <PyObject_Malloc@plt>
=> 0x00007ffff7b267c9 <+25>: test %rax,%rax
0x00007ffff7b267cc <+28>: mov %rax,%rbx
0x00007ffff7b267cf <+31>: je 0x7ffff7b26820 <_PyObject_GC_Malloc+112>
0x00007ffff7b267d1 <+33>: movq $0xfffffffffffffffe,0x10(%rax)
0x00007ffff7b267d9 <+41>: mov 0x297da5(%rip),%eax # 0x7ffff7dbe584 <generations+36>
0x00007ffff7b267df <+47>: mov 0x297d9b(%rip),%edx # 0x7ffff7dbe580 <generations+32>
0x00007ffff7b267e5 <+53>: add $0x1,%eax
0x00007ffff7b267e8 <+56>: cmp %edx,%eax
0x00007ffff7b267ea <+58>: mov %eax,0x297d94(%rip) # 0x7ffff7dbe584 <generations+36>
0x00007ffff7b267f0 <+64>: jle 0x7ffff7b26810 <_PyObject_GC_Malloc+96>
You can see here insetad of a regular jump it’s doing:
callq 0x7ffff7a35430 <PyObject_Malloc@plt>
This means that it’s going to resolve the symbol using the dynamic linker table,
i.e. via the PLT. And indeed it works. If you want to learn about the vagaries
of how the PLT works, you can try running a dynamically linked binary (e.g.
/bin/ls) on your local system and then disassembling any call site that goes
through the PLT. In particular, I found
this online tutorial
helpful to getting me to the above point in my GDB analysis.
Tomorrow I’ll have fun extending my DSO to be a bit more interesting by
recording actual tracing information. I’ll also try to benchmark this approach
to see how impactful the hash table is.