So I've been doing a fair amount of x86-64 programming lately. As a result, I've learned both a lot about the 32-bit x86 architecture, as well as the changes they made for 64-bit machines.
Even if you don't know anything about x86 assembly programming, you may have heard from others that compared to other architectures it is a huge mess. There's a couple of reasons for this.
First of all, the language has been kept mostly backwards compatible. Intel started designing processors in the 70s (their first commercially available microprocessor, the Intel 4004, was released in 1971). So there are a lot of carry-overs from that time period.
Today we think of Intel as a company whose primary business is making CPUs. But that's not how they started. In fact, the initial business Intel was in was making memory (SRAM). At the time it was incredibly expensive to manufacture RAM, and many systems were more constrained by RAM than any other resource. If you have a slow CPU your program may take a long time to run, but it will run. On the other hand, if you don't have enough memory, you may not be able to run your program at all. This is doubly true back in the day when virtual memory didn't exist, at least not for microcomputers. For more than the first ten years of its existence, Intel's primary business was making RAM.
By the 1980s things had changed and other companies had figured out how to make RAM at competitive prices. Intel shifted its business model towards making and selling CPUS.
However, because of their early history as a company making memory chips, and because of the desire to attempt to keep a backwards compatible machine language, the design of x86 is hyper-focused on instruction size. In pretty much every conceivable area, if there's a way to encode some instructions more compactly, that's how it's done.
I'll demonstrate this by way of the venerable
CALL instruction, which is used
to call another method. The
CALL instruction needs an operand---the address
to call to.
Now, imagine it's the 1970s or 1980s, you're extremely memory constrained. So
you want to make these
CALL instructions really compact. Since you're so
memory contrained, there's a good chance that the function you're calling is
going to be nearby in memory. For instance, you could imagine that if you're
really lucky, and if the compiler laid things out for you correctly (or more
likely: you did in your hand written assembler) the function you're calling is
quite likely to be within +/- 32kb from you. I mean, how many programs are there
that are even more than 64kb in size, right? And by +/32kb, I mean quite
literally in the range -32768 to +32767 bytes. This is actually encoded
relative to the next instruction, which makes it even weirder. The encoding is
rel16 is the
int16_t that encodes the difference from the current
instruction pointer. Again, it actually encodes the difference from the next
instruction, so it encodes the difference between the current instruction
pointer + 3. The total size of this
CALL is a mere 3 bytes!
Now, in 64-bit mode that would be weird. Programs are larger, memory is less scarce, and you're less likely to call functions within 32kb of you. Plus you want to keep those opcodes small, so you want to reuse them if possible. So Intel (really---AMD) deprecated the old mode, preserved the same opcode, and now the new instruction is like:
rel32 encodes the
int32_t that is the difference from the next
instruction; that is, the value of
%rip + 5.
So this is still kind of weird because of the relative addressing. But whatever,
it kind of makes sense. Most programs don't have a
.text extension larger than
2GB (even today), so this is actually pretty reasonable.
However, it's the year 2016, and sometimes we might want to call a function
that is more than 2GB from the current instruction pointer. Actually, it's not
just sometimes. On Linux, when calling a shared library on a system with ASLR,
memory addresses are typically mapped to 0x7f0000000000 or higher which
definitely can't be called by a
rel32 immediate operand.
Surely there's a way to
CALL with a 64-bit operand.
Actually, no. Here's what you have to do instead:
movabs $0x0123456789ab, %rax call *%rax
Here's what that looks like disassembled:
0: 48 b8 ab 89 67 45 23 movabs $0x123456789ab,%rax 7: 01 00 00 a: ff d0 callq *%rax
So we have to spend two bytes to encode the
MOV opcode, eight bytes to encode
the operand, and two more bytes to encode the
CALL opcode, for a total of 12
The same is true for a
JUMP. To do a 64-bit jump, you would do
movabs $0x0123456789ab, %rax jmpq *%rax
Here's what that looks like disassembled:
0: 48 b8 ab 89 67 45 23 movabs $0x123456789ab,%rax 7: 01 00 00 a: ff e0 jmpq *%rax
This is really annoying, because it uses a lot of instructions, and we need to use a register to do it. If we need to save the value of the register we're using we have to use more instructions to save it somehow (e.g. push it onto the stack) which makes things even larger.
Here's where it gets really awful, for me at least. Intel has a class of
instructions that it calls
Jcc, short for "jump if condition is met". These
are things with mnemonics like
js (jump if signed),
jz (jump if zero),
(jump if not equal), etc.
Guess how you encode a
Jcc instruction with a 64-bit target? Well, you can't.
You just can't. You can jump to another location that does the 12 byte dance I
demonstrated above, but you can't directly jump to a 64-bit location, no matter
how much you might want to. The technique of jumping to another location that
does the jump for you is called a
Here's why this is currently really annoying for me. The function prologue of
PyObject_Malloc looks like this:
Dump of assembler code for function PyObject_Malloc: 0x0000000000499750 <+0>: test %rdi,%rdi 0x0000000000499753 <+3>: js 0x4182be 0x0000000000499759 <+9>: lea -0x1(%rdi),%rax 0x000000000049975d <+13>: cmp $0x1ff,%rax 0x0000000000499763 <+19>: ja 0x499859 <PyObject_Malloc+265>
I was hoping to do a trick where I'd replace the first 12 bytes with my trick
that jumps to a 64-bit address I've loaded via
mmap(2). The address I've
mapped then does the following trick:
- it would take the instructions in the first 12 bytes of
leainstructions, and execute them by poking them into the mapped area using ptrace.
- it would do
- it would do
- it would do
- it would do
Now, when it gets to the
callq *%rax line it will call into an offset of
PyObject_Malloc. At some point, either in
PyObject_Malloc, or in a function
that is jumped to by
retq instruction will be hit. That
will return right back to my
pop %rdi instruction in the mapped area! At this
point I know the original call value sent to
PyObject_Malloc (I've popped it
%rdi) and I also know the return value of
convention it's in
This makes it easy to set up my profiler, since now I can call my profiling
function that records the allocation call value and return value. And then when
my code calls
retq, it will return to the actual function that originally
made the call to
PyObject_Malloc. Pretty neat, right?
However, what the hell am I supposed to do about that
js instruction? If there
Jcc instructions I could re-encode it in a wider way (and replace
equivalently more data in the
PyObject_Malloc function), but I can't do that.
So now I have to write my own trampolining code. I'm definitely capable of
writing the trampolining code, but it's starting to make things really complex.
Things are especially complex when you need to track what registers are safe to
use (which thankfully Capstone makes
I'm also considering trying to just mmap into an area within 2GB of memory of
the actualy binary to avoid this whole problem. This would avoid a lot of
problems since I won't have to widen any instructions, and therefore I wouln't
have to use any trampolines. I would have to re-encode the jumps and calls, but
that isn't difficult (at this point I've already written that code). There's no
guarantee that you can mmap within this region (although look at
/proc/pid/maps, it looks like in practice this will never be an issue). This
technique also means that I'm basically depending on my
/usr/bin/python to be
-fPIC, which would make it less portable, but in my case I know
that's the case. So I might just start doing it this way and then make it more
Anyway, hope you had fun learning about the joys of x86 programming. And once again huge shoutout to Capstone which makes runtime disassembly really easy to do.