The 64-bit Extensions to x86 Are a Mess

So I've been doing a fair amount of x86-64 programming lately. As a result, I've learned both a lot about the 32-bit x86 architecture, as well as the changes they made for 64-bit machines.

Even if you don't know anything about x86 assembly programming, you may have heard from others that compared to other architectures it is a huge mess. There's a couple of reasons for this.

First of all, the language has been kept mostly backwards compatible. Intel started designing processors in the 70s (their first commercially available microprocessor, the Intel 4004, was released in 1971). So there are a lot of carry-overs from that time period.

Today we think of Intel as a company whose primary business is making CPUs. But that's not how they started. In fact, the initial business Intel was in was making memory (SRAM). At the time it was incredibly expensive to manufacture RAM, and many systems were more constrained by RAM than any other resource. If you have a slow CPU your program may take a long time to run, but it will run. On the other hand, if you don't have enough memory, you may not be able to run your program at all. This is doubly true back in the day when virtual memory didn't exist, at least not for microcomputers. For more than the first ten years of its existence, Intel's primary business was making RAM.

By the 1980s things had changed and other companies had figured out how to make RAM at competitive prices. Intel shifted its business model towards making and selling CPUS.

However, because of their early history as a company making memory chips, and because of the desire to attempt to keep a backwards compatible machine language, the design of x86 is hyper-focused on instruction size. In pretty much every conceivable area, if there's a way to encode some instructions more compactly, that's how it's done.

I'll demonstrate this by way of the venerable CALL instruction, which is used to call another method. The CALL instruction needs an operand---the address to call to.

Now, imagine it's the 1970s or 1980s, you're extremely memory constrained. So you want to make these CALL instructions really compact. Since you're so memory contrained, there's a good chance that the function you're calling is going to be nearby in memory. For instance, you could imagine that if you're really lucky, and if the compiler laid things out for you correctly (or more likely: you did in your hand written assembler) the function you're calling is quite likely to be within +/- 32kb from you. I mean, how many programs are there that are even more than 64kb in size, right? And by +/32kb, I mean quite literally in the range -32768 to +32767 bytes. This is actually encoded relative to the next instruction, which makes it even weirder. The encoding is like this:

0xe8 rel16

Where rel16 is the int16_t that encodes the difference from the current instruction pointer. Again, it actually encodes the difference from the next instruction, so it encodes the difference between the current instruction pointer + 3. The total size of this CALL is a mere 3 bytes!

Now, in 64-bit mode that would be weird. Programs are larger, memory is less scarce, and you're less likely to call functions within 32kb of you. Plus you want to keep those opcodes small, so you want to reuse them if possible. So Intel (really---AMD) deprecated the old mode, preserved the same opcode, and now the new instruction is like:

0xE8 rel32

Where rel32 encodes the int32_t that is the difference from the next instruction; that is, the value of %rip + 5.

So this is still kind of weird because of the relative addressing. But whatever, it kind of makes sense. Most programs don't have a .text extension larger than 2GB (even today), so this is actually pretty reasonable.

However, it's the year 2016, and sometimes we might want to call a function that is more than 2GB from the current instruction pointer. Actually, it's not just sometimes. On Linux, when calling a shared library on a system with ASLR, memory addresses are typically mapped to 0x7f0000000000 or higher which definitely can't be called by a rel32 immediate operand.

Surely there's a way to CALL with a 64-bit operand.

Actually, no. Here's what you have to do instead:

movabs $0x0123456789ab, %rax
call *%rax

Here's what that looks like disassembled:

   0:    48 b8 ab 89 67 45 23     movabs $0x123456789ab,%rax
   7:    01 00 00
   a:    ff d0                    callq  *%rax

So we have to spend two bytes to encode the MOV opcode, eight bytes to encode the operand, and two more bytes to encode the CALL opcode, for a total of 12 bytes.

The same is true for a JUMP. To do a 64-bit jump, you would do

movabs $0x0123456789ab, %rax
jmpq *%rax

Here's what that looks like disassembled:

   0:    48 b8 ab 89 67 45 23     movabs $0x123456789ab,%rax
   7:    01 00 00
   a:    ff e0                    jmpq   *%rax

This is really annoying, because it uses a lot of instructions, and we need to use a register to do it. If we need to save the value of the register we're using we have to use more instructions to save it somehow (e.g. push it onto the stack) which makes things even larger.

Here's where it gets really awful, for me at least. Intel has a class of instructions that it calls Jcc, short for "jump if condition is met". These are things with mnemonics like js (jump if signed), jz (jump if zero), jne (jump if not equal), etc.

Guess how you encode a Jcc instruction with a 64-bit target? Well, you can't. You just can't. You can jump to another location that does the 12 byte dance I demonstrated above, but you can't directly jump to a 64-bit location, no matter how much you might want to. The technique of jumping to another location that does the jump for you is called a [trampoline](https://en.wikipedia.org/wiki/Trampoline_(computing)).

Here's why this is currently really annoying for me. The function prologue of PyObject_Malloc looks like this:

Dump of assembler code for function PyObject_Malloc:
   0x0000000000499750 <+0>:     test   %rdi,%rdi
   0x0000000000499753 <+3>:     js     0x4182be
   0x0000000000499759 <+9>:     lea    -0x1(%rdi),%rax
   0x000000000049975d <+13>:    cmp    $0x1ff,%rax
   0x0000000000499763 <+19>:    ja     0x499859 <PyObject_Malloc+265>

I was hoping to do a trick where I'd replace the first 12 bytes with my trick that jumps to a 64-bit address I've loaded via mmap(2). The address I've mapped then does the following trick:

Now, when it gets to the callq *%rax line it will call into an offset of PyObject_Malloc. At some point, either in PyObject_Malloc, or in a function that is jumped to by PyObject_Malloc, a retq instruction will be hit. That will return right back to my pop %rdi instruction in the mapped area! At this point I know the original call value sent to PyObject_Malloc (I've popped it into %rdi) and I also know the return value of PyObject_Malloc (by convention it's in %rax).

This makes it easy to set up my profiler, since now I can call my profiling function that records the allocation call value and return value. And then when my code calls retq, it will return to the actual function that originally made the call to PyObject_Malloc. Pretty neat, right?

However, what the hell am I supposed to do about that js instruction? If there were 64-bit Jcc instructions I could re-encode it in a wider way (and replace equivalently more data in the PyObject_Malloc function), but I can't do that. So now I have to write my own trampolining code. I'm definitely capable of writing the trampolining code, but it's starting to make things really complex. Things are especially complex when you need to track what registers are safe to use (which thankfully Capstone makes possible).

I'm also considering trying to just mmap into an area within 2GB of memory of the actualy binary to avoid this whole problem. This would avoid a lot of problems since I won't have to widen any instructions, and therefore I wouln't have to use any trampolines. I would have to re-encode the jumps and calls, but that isn't difficult (at this point I've already written that code). There's no guarantee that you can mmap within this region (although look at /proc/pid/maps, it looks like in practice this will never be an issue). This technique also means that I'm basically depending on my /usr/bin/python to be built without -fPIC, which would make it less portable, but in my case I know that's the case. So I might just start doing it this way and then make it more portable later.

Anyway, hope you had fun learning about the joys of x86 programming. And once again huge shoutout to Capstone which makes runtime disassembly really easy to do.