The GCC developers have been working on getting LTO (link time optimization) into GCC and binutils for some time now. It works pretty well now; but for a while it was rough around the edges.
I've been looking at LTO generated code a bit recently, and I want to explain how it works (as I understand from the perspective of someone reverse assembling code). I'm going to be assuming 64-bit architectures here. The same optimizations would apply on a 32-bit architecture, but since the encoded instruction sizes would differ the places when LTO would be used might differ across these architectures.
The most obvious example of how LTO is useful is inlining functions across
compilation units. This means if a function is used only once, or the function
itself is very small, the compiler could decide, as an optimization, to replace
calls to the function with the actual function contents (although some modern
compilers can inline if you to declare an inline function in a header with the
implementation in a .c
or .cc
file). Inlined functions are old school at
this point, the novel part here is doing it across compilation units.
However, there's a much more interesting optimization I found that GCC implements with LTO, which is what I will focus on in this article. There is a commonly occurring pattern in your code where you may run some sequence of logic and then return a value based on that logic.
For a particular instance of this pattern, consider how the malloc()
function
might work. The signature to malloc()
is:
void* malloc(size_t nbytes);
It would be useful if the malloc()
function could fail in some cases. For
instance, let's say you invoke malloc(0)
, you would want to return NULL
.
There's a similar check you can do to see if someone supplied what appears to be
a "negative" nbytes
argument (even though, by definition, size_t
is
unsigned). To be really clear here: in the usual case, if you tried to actually
put malloc(0)
or malloc(-1)
in your code your compiler would probably be
smart enough to warn you (if not error). These compiler checks are generally not
available though when you consider that malloc()
could be passed as a
"functor" to another data structure, e.g. a generic C hash table built using
functors.
Let's focus on disallowing malloc(0)
case since it's relatively easy to
understand. In C we would would want some code like this:
void* malloc(size_t nbytes) {
if (nbytes == 0)
return NULL;
/* more code here */
return ptr;
}
The code for malloc()
in my pseudo-asm might start like
test %rdi,%rdi
jne fail
...
fail:
xor %rax,%rax # encoded by three bytes
ret # encoded by one byte
The last two instructions set the return value to 0 and return to the caller, effectively returning a null pointer. They take four bytes total to encode.
Four bytes isn't a lot, but it happens all the time. You can consider this
xor/ret
pair as implementing the following code:
return 0; // or equivalently, return NULL
How many times do you think that code comes up in a large program? The answer is all the time. Also, returning 0 is the optimal case. It takes even more bytes to return 1 or -1. So if we could stop generating this code over and over again we'd save space every time we could omit it. This saves disk space and memory space. It also makes data more likely to be in one of the CPU memory caches. There are a lot of reasons why this optimization is valuable.
Depending on the exact structure of this function, the jne
instruction will
either take two bytes to encode or six bytes two encode. It takes two bytes if
the operand is a rel8 and six bytes if the operand is a rel32. A rel8 is a
weird representation of a signed byte in Intel assembler, but you can basically
think of it as representing the range -128 to +127 with respect to the next
instruction. This is true of basically all of the "relative" jump instructions
like jz
, jnz
, jo
, and so forth: they'll be encoded in two bytes if the
operand is a rel8 and encoded in six bytes if the operand is a rel32. This
class of instructions is referred to as Jcc
in the Intel instruction manual,
which curiously stands for "jump if condition is met".
I'm going to repeat this to make it explicit, because it was really confusing to
me when I learned x86, despite already being proficient in C. In a language like
C you have pointers that represent absolute addresses. When you're on a 64-bit C
system, a pointer is always 8 bytes. That's just how it is. So if you're used to
C you might guess that a something like a "jump" instruction would have an 8
byte target on a 64-bit system. I'm not going to get into the details of x86
(it's incredibly complicated), but the short version is that if you have a
conditional jump instruction like jne
you either have two options: a rel8
that is encoded as a signed 8-bit number relative to the end of the jne
instruction, or a rel32 instruction which is the same but encoded as a signed
32-bit integer. So the operand to the a conditional instruction is always how
many bytes forward or backwards to jump. If for some reason you want to know the
absolute address that ends up being the target of the jump you need to do the
math yourself. As an aside: at the end, I'll explain how it's possible to
implement 64-bit conditional jumps; also, some instructions use right shifted
targets, which I won't cover for space reasons.
Back to our hypothetical malloc()
implementation. If we encode the jne
using
a rel8 then we should just do the usual thing: use two bytes to encode the
jne
and four bytes to encode the xor/ret
.
A lot of the time we won't be so lucky though. Particularly when a function is
relatively complicated, the chances of us getting a rel8 encoding for our
jne
aren't as good. If you look at the disassembly for a complex C or C++
function there are a lot of jumps that are hundreds (or thousands) of bytes
away, so it's really common to use the six byte Jcc rel32
encoding anyway.
Since we have to use six bytes to encode this "huge" Jcc rel32
instruction, it
turns out that we can implement a really interesting optimization using LTO.
In the .text
section of an ELF binary there's nothing that prohibits you from
putting extra bytes start of the section. So you're allowed to put any arbitrary
number of bytes before the first actual function in the .text
section.
Let's imagine that we call this the LTO section, even though the section is
unnamed; in reality it's literally just a string of bytes before the first
actual function defined in the .text
function, and those bytes happen to
encode x86 instructions. In the LTO section GCC will implement a series of
common sequences of instructions that generally ret
out of the LTO section.
The LTO section might look like this (extra comments and whitespace added for clarity):
xor %rax, %rax # set return value to 0
ret
mov $1, %rax # set return value to 1
ret
mov $-1, %rax # set return value to -1
ret
So if we wanted to do a rel32 jump to code that returns -1, 0, or 1 we can
instead just do the same jump into the LTO section. The jump instruction is just
as expensive, since the operand is a rel32 in either case. We save 4 bytes in
the return 0
case and in this implementation we save 8 bytes in the return 1
and return -1
cases, which curiously
take more bytes to encode.
Looking at code like this as a human is really confusing because the LTO section
isn't a regular function---instead it's a bunch of garbled code that returns
(or somehow jumps back) to non-LTO code.
64-bit Jumps
Previously I mentioned that you can't conditionally jump to a 64-bit address. This is a real bummer with shared libraries. On Linux typically statically linked methods will be mapped to low addresses and shared libraries will be mapped to addresses outside of the range of a rel32. So conditionally jumping to a shared library method is problematic.
You also can't directly call an arbitrary 64-bit address using an immediate operand; nor can you directly jump to an arbitrary 64-bit address.
However, you can directly call the value in a register. Likewise, you can directly jump to the address in a register.
Here's how you combine it. You do a jump (hopefully rel8, but rel32 is fine too) to a location that moves a 64-bit value into a register. Then you jump to the value in that register. So the pseudo code would look like:
... # condition here
jne label
...
label:
mov $0x7ff19543d000, %rax
jmp *%rax
So this lets us conditionally jump to 0x7ff19543d000
at the expense of using
an extra register, %rax
in this case. Usually this is OK; in the worst case,
you might have to figure out how to save a register before replacing its
contents. But if you're writing asm, this should already be second nature to
you.