How GCC LTO Works

The GCC developers have been working on getting LTO (link time optimization) into GCC and binutils for some time now. It works pretty well now; but for a while it was rough around the edges.

I've been looking at LTO generated code a bit recently, and I want to explain how it works (as I understand from the perspective of someone reverse assembling code). I'm going to be assuming 64-bit architectures here. The same optimizations would apply on a 32-bit architecture, but since the encoded instruction sizes would differ the places when LTO would be used might differ across these architectures.

The most obvious example of how LTO is useful is inlining functions across compilation units. This means if a function is used only once, or the function itself is very small, the compiler could decide, as an optimization, to replace calls to the function with the actual function contents (although some modern compilers can inline if you to declare an inline function in a header with the implementation in a .c or .cc file). Inlined functions are old school at this point, the novel part here is doing it across compilation units.

However, there's a much more interesting optimization I found that GCC implements with LTO, which is what I will focus on in this article. There is a commonly occurring pattern in your code where you may run some sequence of logic and then return a value based on that logic.

For a particular instance of this pattern, consider how the malloc() function might work. The signature to malloc() is:

void* malloc(size_t nbytes);

It would be useful if the malloc() function could fail in some cases. For instance, let's say you invoke malloc(0), you would want to return NULL. There's a similar check you can do to see if someone supplied what appears to be a "negative" nbytes argument (even though, by definition, size_t is unsigned). To be really clear here: in the usual case, if you tried to actually put malloc(0) or malloc(-1) in your code your compiler would probably be smart enough to warn you (if not error). These compiler checks are generally not available though when you consider that malloc() could be passed as a "functor" to another data structure, e.g. a generic C hash table built using functors.

Let's focus on disallowing malloc(0) case since it's relatively easy to understand. In C we would would want some code like this:

void* malloc(size_t nbytes) {
  if (nbytes == 0)
    return NULL;
  /* more code here */
  return ptr;
}

The code for malloc() in my pseudo-asm might start like

test %rdi,%rdi
jne fail
...
fail:
xor %rax,%rax    # encoded by three bytes
ret              # encoded by one byte

The last two instructions set the return value to 0 and return to the caller, effectively returning a null pointer. They take four bytes total to encode.

Four bytes isn't a lot, but it happens all the time. You can consider this xor/ret pair as implementing the following code:

return 0;  // or equivalently, return NULL

How many times do you think that code comes up in a large program? The answer is all the time. Also, returning 0 is the optimal case. It takes even more bytes to return 1 or -1. So if we could stop generating this code over and over again we'd save space every time we could omit it. This saves disk space and memory space. It also makes data more likely to be in one of the CPU memory caches. There are a lot of reasons why this optimization is valuable.

Depending on the exact structure of this function, the jne instruction will either take two bytes to encode or six bytes two encode. It takes two bytes if the operand is a rel8 and six bytes if the operand is a rel32. A rel8 is a weird representation of a signed byte in Intel assembler, but you can basically think of it as representing the range -128 to +127 with respect to the next instruction. This is true of basically all of the "relative" jump instructions like jz, jnz, jo, and so forth: they'll be encoded in two bytes if the operand is a rel8 and encoded in six bytes if the operand is a rel32. This class of instructions is referred to as Jcc in the Intel instruction manual, which curiously stands for "jump if condition is met".

I'm going to repeat this to make it explicit, because it was really confusing to me when I learned x86, despite already being proficient in C. In a language like C you have pointers that represent absolute addresses. When you're on a 64-bit C system, a pointer is always 8 bytes. That's just how it is. So if you're used to C you might guess that a something like a "jump" instruction would have an 8 byte target on a 64-bit system. I'm not going to get into the details of x86 (it's incredibly complicated), but the short version is that if you have a conditional jump instruction like jne you either have two options: a rel8 that is encoded as a signed 8-bit number relative to the end of the jne instruction, or a rel32 instruction which is the same but encoded as a signed 32-bit integer. So the operand to the a conditional instruction is always how many bytes forward or backwards to jump. If for some reason you want to know the absolute address that ends up being the target of the jump you need to do the math yourself. As an aside: at the end, I'll explain how it's possible to implement 64-bit conditional jumps; also, some instructions use right shifted targets, which I won't cover for space reasons.

Back to our hypothetical malloc() implementation. If we encode the jne using a rel8 then we should just do the usual thing: use two bytes to encode the jne and four bytes to encode the xor/ret.

A lot of the time we won't be so lucky though. Particularly when a function is relatively complicated, the chances of us getting a rel8 encoding for our jne aren't as good. If you look at the disassembly for a complex C or C++ function there are a lot of jumps that are hundreds (or thousands) of bytes away, so it's really common to use the six byte Jcc rel32 encoding anyway.

Since we have to use six bytes to encode this "huge" Jcc rel32 instruction, it turns out that we can implement a really interesting optimization using LTO.

In the .text section of an ELF binary there's nothing that prohibits you from putting extra bytes start of the section. So you're allowed to put any arbitrary number of bytes before the first actual function in the .text section.

Let's imagine that we call this the LTO section, even though the section is unnamed; in reality it's literally just a string of bytes before the first actual function defined in the .text function, and those bytes happen to encode x86 instructions. In the LTO section GCC will implement a series of common sequences of instructions that generally ret out of the LTO section.

The LTO section might look like this (extra comments and whitespace added for clarity):

xor %rax, %rax   # set return value to 0
ret

mov $1, %rax     # set return value to 1
ret

mov $-1, %rax    # set return value to -1
ret

So if we wanted to do a rel32 jump to code that returns -1, 0, or 1 we can instead just do the same jump into the LTO section. The jump instruction is just as expensive, since the operand is a rel32 in either case. We save 4 bytes in the return 0 case and in this implementation we save 8 bytes in the return 1 and return -1 cases, which curiously take more bytes to encode. Looking at code like this as a human is really confusing because the LTO section isn't a regular function---instead it's a bunch of garbled code that returns (or somehow jumps back) to non-LTO code.

64-bit Jumps

Previously I mentioned that you can't conditionally jump to a 64-bit address. This is a real bummer with shared libraries. On Linux typically statically linked methods will be mapped to low addresses and shared libraries will be mapped to addresses outside of the range of a rel32. So conditionally jumping to a shared library method is problematic.

You also can't directly call an arbitrary 64-bit address using an immediate operand; nor can you directly jump to an arbitrary 64-bit address.

However, you can directly call the value in a register. Likewise, you can directly jump to the address in a register.

Here's how you combine it. You do a jump (hopefully rel8, but rel32 is fine too) to a location that moves a 64-bit value into a register. Then you jump to the value in that register. So the pseudo code would look like:

...  # condition here
jne label
...
label:
mov $0x7ff19543d000, %rax
jmp *%rax

So this lets us conditionally jump to 0x7ff19543d000 at the expense of using an extra register, %rax in this case. Usually this is OK; in the worst case, you might have to figure out how to save a register before replacing its contents. But if you're writing asm, this should already be second nature to you.