I did some fun things tonight with LD_PRELOAD
and GDB that made me feel
a bit like a real hacker.
Some Context
I'm trying to identify memory bloat in Python applications. If you're familiar with Python you know that there are already a bunch of heap profiling tools for Python, and they all kind of suck. They are all implemented using the gc module which gives you three useful functions:
gc.get_objects()
gives you a list of all live objects in the heapgc.get_referrers()
lets you figure out what has a reference to an objectgc.get_referents()
lets you figure out what references an object has
Pretty much all of the existing heap profiling tools do something where you
periodically inspect the heap and generate a count of all of the object types
and/or the references between the objects. Then you're supposed to look at this
output and figure out "oh it looks like there are a lot of SomeClass
objects
allocated" and then maybe look at the object references and try to figure out
why that's happening.
In a pinch this will work, but it's really hard to use. A much better way to do
heap profiling is how the
Valgrind massif tool does it.
What massif will do is tell you, at any given point in time, where all of the
allocated memory was actually allocated. It will track the full stack for each
allocation. The difference here is huge: you can look at the massif output and
know "20% of my memory was allocated in foo.cc:214
" or "30% of the memory was
allocated by libleak.so
" and then get an idea of where in your code to
actually start looking to understand the problem. The Python tools don't give
this to you, because the Python tools will tell you that you have a lot of dicts
and lists and ints and whatnot, but won't tell you where they actually came from
in your code, i.e. what lines of code you should be looking at.
Actually, some smart people knew about this a long time ago and came up with the tracemalloc tool. If you are lucky enough to be running Python 3.4 or later this feature is built into Python. If you're on Python 2.x or an earlier Python 3.x release you have to patch your Python interpreter to use tracemalloc.
The way tracemalloc works is it extends the struct PyObject
type to have a
pointer to some allocation information. This is the natural way to implement
heap tracking. It's also the most efficient way to do it. However, there's a
serious drawback: because it changes the size of a struct PyObject
it will
break ABI compatibility with pre-compiled C libraries linking against Python
code. So if your code uses a library like mysqldb
or numpy
you'll need to
rebuild those libraries from source.
I had another idea. My idea was to use LD_PRELOAD
to load in a patched version
of the following three methods:
PyObject_Malloc()
PyObject_Realloc()
PyObject_Free()
Then what I would do is
dlopen()
libpython2.7.so
and have my versions call into the
libpython2.7.so
defined implementations. My versions can then track the
information I care about (size and some basic Python stack information) for
pointers allocated/freed by these functions. The information is stored in a hash
table keyed on void *
and mapping to a pointer of my custom allocation struct
type.
This is quite a bit less efficient than the tracemalloc approach because it has
to use a hash table rather than structural embedding, so it uses more memory and
uses more CPU time. However, it's nice because you don't need to patch Python
(scary), you don't change the Python ABI, and if you don't LD_PRELOAD
the
patched versions there is literally zero runtime overhead (since you're running
an unpatched Python).
The Problem
I got this all working pretty easily on my Fedora 23 workstation using GLib and GHashTable to implement the hash table data structure. I ran into problems when running it on Ubuntu 12.04 (a.k.a. "Precise").
The first problem I had is that /usr/bin/python2.7
on Fedora is actually a
tiny (~8Kb) binary that links against libpython2.7.so
. It does not
link against libpython2.7.so
on Ubuntu; instead, Ubuntu statically compiles
the Python core into the python
binary. It turns out that this behavior is
toggled in the Python build chain in the configure script. If you want the
dynamic linking behavior (as I do) you want to run configure like:
./configure --enable-shared
You can then confirm that the binary you build links against libpython2.7.so
by examining the output of ldd
.
Once I got this working, I ran into another problem. I was able to build my
preload DSO fine on Ubuntu, but when I actually tried to load it with
LD_PRELOAD
I'd get errors about unresolved symbols related to GLib. ELF
binaries/libraries have a DT_NEEDED
section which lists what libraries they
depend on. This was working correctly on Fedora, but not on Ubuntu. From my
understanding of the man pages from the system I had thought I had needed
-Wl,--copy-dt-needed-entries
. It turns out that I actually needed
-Wl,--no-as-needed
. The names of these options and the default Debian/Ubuntu
behavior has changed multiple times over the last few years so it's kind of
unclear for exactly which ld
and Debian/Ubuntu release what options are
needed, I just ended up using both in the Makefile.
I got this working and saw that my preload DSO could start up, initialize
itself, initialize the GLib hash table, and all that. Great. I could not,
however, get the python
binary that I compiled to actually use the
PyObject_*
symbols exported by my DSO.
I eventually solved the problem using some GDB hacks. I'll explain:
evan@azrael ~ $ export LD_PRELOAD=pytraceobject/libpytraceobject.so
evan@azrael ~ $ gdb ./Python-2.7.11/python
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.launchpad.net/gdb-linaro/>...
Reading symbols from /home/evan/Python-2.7.11/python...done.
Here I've exported my DSO and started GDB with the python binary I compiled. Great. Now what I want to do is have it run until main starts so I can see what symbols are loaded:
(gdb) break main
Breakpoint 1 at 0x400630: file ./Modules/python.c, line 23.
(gdb) run
Starting program: /home/evan/Python-2.7.11/python
warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7ffff7ffa000
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Breakpoint 1, main (argc=1, argv=0x7fffffffea98) at ./Modules/python.c:23
23 return Py_Main(argc, argv);
Here you can see we've entered main()
. Let's look up PyObject_Malloc
:
(gdb) info functions PyObject_Malloc
All functions matching regular expression "PyObject_Malloc":
Non-debugging symbols:
0x00007ffff7bd7daa PyObject_Malloc
0x00007ffff7809cd0 PyObject_Malloc
There are two symbols, which is what we expect. I inspected the process with
pmap
and confirmed that the address for the first one listed is from the DSO
I created, and the second one is from libpython2.7.so
.
(gdb) info addr PyObject_Malloc
Symbol "PyObject_Malloc" is at 0x00007ffff7bd7daa in a file compiled without debugging.
Here GDB is saying that it thinks PyObject_Malloc
resolves to the one in my
DSO. OK, let's set a break point and see what actually happens:
(gdb) break PyObject_Malloc
Breakpoint 2 at 0x7ffff7bd7dae (2 locations)
(gdb) c
Continuing.
Breakpoint 2, 0x00007ffff7809cd0 in PyObject_Malloc () from /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0
We can see here that the version it actually stopped at is the one loaded by
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0
. You can see exactly why this
is by disassembling the call site:
(gdb) up
#1 0x00007ffff77ea4c9 in _PyObject_GC_Malloc () from /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0
(gdb) disas 0x00007ffff77ea4c9
Dump of assembler code for function _PyObject_GC_Malloc:
0x00007ffff77ea4b0 <+0>: movabs $0x7fffffffffffffdf,%rax
0x00007ffff77ea4ba <+10>: push %rbx
0x00007ffff77ea4bb <+11>: cmp %rax,%rdi
0x00007ffff77ea4be <+14>: ja 0x7ffff77ea520 <_PyObject_GC_Malloc+112>
0x00007ffff77ea4c0 <+16>: add $0x20,%rdi
0x00007ffff77ea4c4 <+20>: callq 0x7ffff7809cd0 <PyObject_Malloc>
=> 0x00007ffff77ea4c9 <+25>: test %rax,%rax
0x00007ffff77ea4cc <+28>: mov %rax,%rbx
0x00007ffff77ea4cf <+31>: je 0x7ffff77ea520 <_PyObject_GC_Malloc+112>
0x00007ffff77ea4d1 <+33>: movq $0xfffffffffffffffe,0x10(%rax)
0x00007ffff77ea4d9 <+41>: mov 0x3763e5(%rip),%eax # 0x7ffff7b608c4
0x00007ffff77ea4df <+47>: mov 0x3763db(%rip),%edx # 0x7ffff7b608c0
0x00007ffff77ea4e5 <+53>: add $0x1,%eax
0x00007ffff77ea4e8 <+56>: cmp %edx,%eax
0x00007ffff77ea4ea <+58>: mov %eax,0x3763d4(%rip) # 0x7ffff7b608c4
0x00007ffff77ea4f0 <+64>: jle 0x7ffff77ea510 <_PyObject_GC_Malloc+96>
The important thing here is that it's doing a direct jump. It's not going through the PLT.
In hindsight, the problem was pretty obvious. Since the system libpython2.7.so
wasn't compiled to be a shared library it won't use the PLT to do symbol
lookups. That means that if you call into it, all of the Python symbols will be
resolved within that DSO as if they are static. While I was debugging this I was
aware from the beginning that my custom binary was using the system
libpython2.7.so
and not the one I had compiled. However, due to my relatively
weak knowledge about how dynamic linking actually works I had assumed it would
not be an issue since I merely wanted a binary that would invoke Python using
libpython2.7.so
so I could override symbols via LD_PRELOAD
.
In fact, the system version is itself a relocatable shared object. The problem
is that it's compiled with -shared
but not -fpic
. This means that all of the
symbol relocations internally are resolved at load time. When compiling with
-fpic
(or -fPIC
) the code will be fully position independent, meaning that
all function calls will be made via the PLT. Eli Bendersky has two great
articles on this topic, covering the difference between
load-time relocation of shared libraries
and
position independent code in shared libraries.
When building Python with ./configure --enable-shared
you're actually
instructing Python to generate fully position independent code.
This is easily fixable in my case by using LD_LIBRARY_PATH
and setting it so
that the custom python
binary I built will prefer the libpython2.7.so
that I
built, which is compiled to be a PIC shared library. In a real production
deployment we would set up the dynamic linker to not need this, but for
debugging it's fine. Let me show you, by comparison, what the GDB output looks
like when using the shared libpython2.7.so
:
Dump of assembler code for function _PyObject_GC_Malloc:
0x00007ffff7b267b0 <+0>: movabs $0x7fffffffffffffdf,%rax
0x00007ffff7b267ba <+10>: push %rbx
0x00007ffff7b267bb <+11>: cmp %rax,%rdi
0x00007ffff7b267be <+14>: ja 0x7ffff7b26820 <_PyObject_GC_Malloc+112>
0x00007ffff7b267c0 <+16>: add $0x20,%rdi
0x00007ffff7b267c4 <+20>: callq 0x7ffff7a35430 <PyObject_Malloc@plt>
=> 0x00007ffff7b267c9 <+25>: test %rax,%rax
0x00007ffff7b267cc <+28>: mov %rax,%rbx
0x00007ffff7b267cf <+31>: je 0x7ffff7b26820 <_PyObject_GC_Malloc+112>
0x00007ffff7b267d1 <+33>: movq $0xfffffffffffffffe,0x10(%rax)
0x00007ffff7b267d9 <+41>: mov 0x297da5(%rip),%eax # 0x7ffff7dbe584 <generations+36>
0x00007ffff7b267df <+47>: mov 0x297d9b(%rip),%edx # 0x7ffff7dbe580 <generations+32>
0x00007ffff7b267e5 <+53>: add $0x1,%eax
0x00007ffff7b267e8 <+56>: cmp %edx,%eax
0x00007ffff7b267ea <+58>: mov %eax,0x297d94(%rip) # 0x7ffff7dbe584 <generations+36>
0x00007ffff7b267f0 <+64>: jle 0x7ffff7b26810 <_PyObject_GC_Malloc+96>
You can see here insetad of a regular jump it's doing:
callq 0x7ffff7a35430 <PyObject_Malloc@plt>
This means that it's going to resolve the symbol using the dynamic linker table,
i.e. via the PLT. And indeed it works. If you want to learn about the vagaries
of how the PLT works, you can try running a dynamically linked binary (e.g.
/bin/ls
) on your local system and then disassembling any call site that goes
through the PLT. In particular, I found
this online tutorial
helpful to getting me to the above point in my GDB analysis.
Tomorrow I'll have fun extending my DSO to be a bit more interesting by recording actual tracing information. I'll also try to benchmark this approach to see how impactful the hash table is.