If you've been following my blog/twitter, you may have picked up on the fact that I'm doing a bunch of disassembly work, particular regarding the Python interpreter. I found something really interesting while doing this work.
Debian Jessie (as of this writing, the latest stable Debian release) ships with
Python 2.7.9 compiled by GCC 4.9.2. Here's the start of the disassembly for
PyObject_Malloc()
:
Dump of assembler code for function PyObject_Malloc:
0x0000000000499750 <+0>: test %rdi,%rdi
0x0000000000499753 <+3>: js 0x4182be
This might look complicated if you don't know x86, but it's actually really
simple. PyObject_Malloc()
accepts a single argument---the number of bytes to
allocate. These two instructions are checking if the argument passed in (via the
%rdi
register) is negative, and if so the code jumps to 0x4182be.
The actual C code in Python that implements this is pretty obfuscated, so I
won't list it here (if you're really curious: look in pymem.h
), but what's
happening conceptually is the prototype for PyObject_Malloc()
is like:
void* PyObject_Malloc(size_t nbytes);
The code checks to see if the argument you passed to PyObject_Malloc()
looks
like a negative value (technically, if your nbytes
argument exceeds the size
of (size_t)PY_SSIZE_T_MAX
). If that's the case, PyObject_Malloc()
will
return NULL
.
Ok, makes sense. These two lines of assembler are checking if the value is negative, and then jumping to 0x4182be. Let's look at what's going on at 0x4182be:
(gdb) disas 0x4182be,0x4182be+3
Dump of assembler code from 0x4182be to 0x4182c1:
0x00000000004182be: xor %eax,%eax
0x00000000004182c0: retq
Again this is pretty simple. The first instruction sets %eax
to zero, and the
second instruction returns from the function. This is effectively the code that
implements the C code:
return NULL;
But wait. What's going on here? The code makes sense. But why did we jump to
0x4182be? That's not even part of PyObject_Malloc()
! I looked at the ELF
sections and according to both objdump
and nm
this is some weird code that
actually doesn't belong to any real method. The objdump
utility lists the code
at 0x4182be as belonging to the very first bytes in the .text
area and
attributes it to <PyDescr_NewMethod-0x20e0>
which means that this isn't
actually code in PyDescr_NewMethod()
, it's code that comes before it, that
doesn't belong to any function at all. In other words, GCC has generated a
prologue of commonly used instructions that it can jump to.
In other words, GCC decided that there's a bunch of this return NULL
business
going on and instead of generating the code over and over again, it generates it
just once, and then jumps to that code. Which kind of makes sense---this saves
space, which is good---but it makes stack traces hard to understand and
definitely makes the disassembly much harder to follow.
I recompiled Python 2.7.9 on a new machine that has GCC 5.3.1. It generates code that looks a lot more like what you'd expect. Here's what it looks like:
Dump of assembler code for function PyObject_Malloc:
0x0000000000460f70 <+0>: test %rdi,%rdi
0x0000000000460f73 <+3>: js 0x4610d8 <PyObject_Malloc+360>
So now it's jumping to code within PyObject_Malloc()
. The code there is like:
0x00000000004610d8 <+360>: xor %eax,%eax
0x00000000004610da <+362>: retq
Just as we expect.
At first I was convinced that this was some weird thing that older GCC versions
were doing. But, it turns out, that's not the case! When I compiled Python 2.7.9
from source on Jessie using GCC 4.9.2 I got essentially the same code as I did
when compiling with GCC 5.3.1, i.e. without the weird jump to the .text
prologue.
I also rebuilt the code with gcc -O3
on Jessie and once again, did not get the
weird jump to the .text
prologue, so it's not a matter of what optimization
level you use.
I looked at the Debian patches, and I believe the compiler option that is
causing this optimization is the use of -flto
, which does link time
optimization. However, when I try to compile Python myself with CFLAGS=-flto
I
get errors
like this
when Python invokes the ar
command. I still have problems when exporting
AR=gcc-ar
which is the recommended workaround. I will update this post if I
figure out how to build Python with -flto
enabled to verify that this is the
option which generates this code.