So I've been doing a fair amount of x86-64 programming lately. As a result, I've learned both a lot about the 32-bit x86 architecture, as well as the changes they made for 64-bit machines.
Even if you don't know anything about x86 assembly programming, you may have heard from others that compared to other architectures it is a huge mess. There's a couple of reasons for this.
First of all, the language has been kept mostly backwards compatible. Intel started designing processors in the 70s (their first commercially available microprocessor, the Intel 4004, was released in 1971). So there are a lot of carry-overs from that time period.
Today we think of Intel as a company whose primary business is making CPUs. But that's not how they started. In fact, the initial business Intel was in was making memory (SRAM). At the time it was incredibly expensive to manufacture RAM, and many systems were more constrained by RAM than any other resource. If you have a slow CPU your program may take a long time to run, but it will run. On the other hand, if you don't have enough memory, you may not be able to run your program at all. This is doubly true back in the day when virtual memory didn't exist, at least not for microcomputers. For more than the first ten years of its existence, Intel's primary business was making RAM.
By the 1980s things had changed and other companies had figured out how to make RAM at competitive prices. Intel shifted its business model towards making and selling CPUS.
However, because of their early history as a company making memory chips, and because of the desire to attempt to keep a backwards compatible machine language, the design of x86 is hyper-focused on instruction size. In pretty much every conceivable area, if there's a way to encode some instructions more compactly, that's how it's done.
I'll demonstrate this by way of the venerable CALL
instruction, which is used
to call another method. The CALL
instruction needs an operand---the address
to call to.
Now, imagine it's the 1970s or 1980s, you're extremely memory constrained. So
you want to make these CALL
instructions really compact. Since you're so
memory contrained, there's a good chance that the function you're calling is
going to be nearby in memory. For instance, you could imagine that if you're
really lucky, and if the compiler laid things out for you correctly (or more
likely: you did in your hand written assembler) the function you're calling is
quite likely to be within +/- 32kb from you. I mean, how many programs are there
that are even more than 64kb in size, right? And by +/32kb, I mean quite
literally in the range -32768 to +32767 bytes. This is actually encoded
relative to the next instruction, which makes it even weirder. The encoding is
like this:
0xe8 rel16
Where rel16
is the int16_t
that encodes the difference from the current
instruction pointer. Again, it actually encodes the difference from the next
instruction, so it encodes the difference between the current instruction
pointer + 3. The total size of this CALL
is a mere 3 bytes!
Now, in 64-bit mode that would be weird. Programs are larger, memory is less scarce, and you're less likely to call functions within 32kb of you. Plus you want to keep those opcodes small, so you want to reuse them if possible. So Intel (really---AMD) deprecated the old mode, preserved the same opcode, and now the new instruction is like:
0xE8 rel32
Where rel32
encodes the int32_t
that is the difference from the next
instruction; that is, the value of %rip + 5
.
So this is still kind of weird because of the relative addressing. But whatever,
it kind of makes sense. Most programs don't have a .text
extension larger than
2GB (even today), so this is actually pretty reasonable.
However, it's the year 2016, and sometimes we might want to call a function
that is more than 2GB from the current instruction pointer. Actually, it's not
just sometimes. On Linux, when calling a shared library on a system with ASLR,
memory addresses are typically mapped to 0x7f0000000000 or higher which
definitely can't be called by a rel32
immediate operand.
Surely there's a way to CALL
with a 64-bit operand.
Actually, no. Here's what you have to do instead:
movabs $0x0123456789ab, %rax
call *%rax
Here's what that looks like disassembled:
0: 48 b8 ab 89 67 45 23 movabs $0x123456789ab,%rax
7: 01 00 00
a: ff d0 callq *%rax
So we have to spend two bytes to encode the MOV
opcode, eight bytes to encode
the operand, and two more bytes to encode the CALL
opcode, for a total of 12
bytes.
The same is true for a JUMP
. To do a 64-bit jump, you would do
movabs $0x0123456789ab, %rax
jmpq *%rax
Here's what that looks like disassembled:
0: 48 b8 ab 89 67 45 23 movabs $0x123456789ab,%rax
7: 01 00 00
a: ff e0 jmpq *%rax
This is really annoying, because it uses a lot of instructions, and we need to use a register to do it. If we need to save the value of the register we're using we have to use more instructions to save it somehow (e.g. push it onto the stack) which makes things even larger.
Here's where it gets really awful, for me at least. Intel has a class of
instructions that it calls Jcc
, short for "jump if condition is met". These
are things with mnemonics like js
(jump if signed), jz
(jump if zero), jne
(jump if not equal), etc.
Guess how you encode a Jcc
instruction with a 64-bit target? Well, you can't.
You just can't. You can jump to another location that does the 12 byte dance I
demonstrated above, but you can't directly jump to a 64-bit location, no matter
how much you might want to. The technique of jumping to another location that
does the jump for you is called a
[trampoline](https://en.wikipedia.org/wiki/Trampoline_(computing)).
Here's why this is currently really annoying for me. The function prologue of
PyObject_Malloc
looks like this:
Dump of assembler code for function PyObject_Malloc:
0x0000000000499750 <+0>: test %rdi,%rdi
0x0000000000499753 <+3>: js 0x4182be
0x0000000000499759 <+9>: lea -0x1(%rdi),%rax
0x000000000049975d <+13>: cmp $0x1ff,%rax
0x0000000000499763 <+19>: ja 0x499859 <PyObject_Malloc+265>
I was hoping to do a trick where I'd replace the first 12 bytes with my trick
that jumps to a 64-bit address I've loaded via mmap(2)
. The address I've
mapped then does the following trick:
- it would take the instructions in the first 12 bytes of
PyObject_Malloc
---that is,test
,js
,lea
instructions, and execute them by poking them into the mapped area using ptrace. - it would do
push %rdi
- it would do
mov $0x49975d,%rax
- it would do
callq *%rax
- it would do
pop %rdi
Now, when it gets to the callq *%rax
line it will call into an offset of
PyObject_Malloc
. At some point, either in PyObject_Malloc
, or in a function
that is jumped to by PyObject_Malloc
, a retq
instruction will be hit. That
will return right back to my pop %rdi
instruction in the mapped area! At this
point I know the original call value sent to PyObject_Malloc
(I've popped it
into %rdi
) and I also know the return value of PyObject_Malloc
(by
convention it's in %rax
).
This makes it easy to set up my profiler, since now I can call my profiling
function that records the allocation call value and return value. And then when
my code calls retq
, it will return to the actual function that originally
made the call to PyObject_Malloc
. Pretty neat, right?
However, what the hell am I supposed to do about that js
instruction? If there
were 64-bit Jcc
instructions I could re-encode it in a wider way (and replace
equivalently more data in the PyObject_Malloc
function), but I can't do that.
So now I have to write my own trampolining code. I'm definitely capable of
writing the trampolining code, but it's starting to make things really complex.
Things are especially complex when you need to track what registers are safe to
use (which thankfully Capstone makes
possible).
I'm also considering trying to just mmap into an area within 2GB of memory of
the actualy binary to avoid this whole problem. This would avoid a lot of
problems since I won't have to widen any instructions, and therefore I wouln't
have to use any trampolines. I would have to re-encode the jumps and calls, but
that isn't difficult (at this point I've already written that code). There's no
guarantee that you can mmap within this region (although look at
/proc/pid/maps
, it looks like in practice this will never be an issue). This
technique also means that I'm basically depending on my /usr/bin/python
to be
built without -fPIC
, which would make it less portable, but in my case I know
that's the case. So I might just start doing it this way and then make it more
portable later.
Anyway, hope you had fun learning about the joys of x86 programming. And once again huge shoutout to Capstone which makes runtime disassembly really easy to do.