All About Linkers

February 15, 2017

In this post I want to explain what linkers and loaders do, and why languages like C and C++ have a linking phase. We'll also see how the linking process works in Go, Java, and Rust.

Motivation

Compiling code is an extremely CPU-intensive task. Even very modest C or C++ projects can take a few seconds to compile, and big projects take much longer. Compiling code is slow because modern optimizing compilers can do extremely high quality code analysis, code generation, and register allocation, and all of that takes a lot of CPU time.

However, when iterating on a project, typically very little code actually changes from one compilation to another:

Third-party libraries change very rarely, if ever.
For any given change, usually only a small number of files in the project are affected, even though the project may have a large total number of files.

Both points are important. Almost all C and C++ programs need to use the functionality provided by libc, which offers routines like printf(), access to system calls, etc. GNU libc (glibc) is over a million lines of code. If compiling "Hello, world" required the compiler to actually look at all of the source code included by a program then compiling "Hello, world" would actually require building the million lines of code in libc. That would be incredibly slow and wasteful---it would make the simplest of programs take an hour to build. Google Chrome clocks in at about fifteen million lines of code. If you're working on a project like this you don't want to have to recompile all fifteen million lines of code just because you changed a single file.

The solution to this problem is to come up with an intermediate representation for "mostly compiled" code. This way you can precompile libraries like glibc, so that building "Hello, world" no longer requires compiling a million lines of code. You can use this same technique for files within a project. This means that when you're working on a project like Google Chrome, you do the slow CPU-intensive task of compiling all fifteen million lines of code once. Subsequently, when making a small change, the compiler will only need to recompile the changed files (and possibly things that depend on them). To tie this all together we'd an efficient system that can combine all of the precompiled code with the code from the project into a single output executable.

This is exactly how [linkers](https://en.wikipedia.org/wiki/Linker_(computing)) and object files work. When the compiler translates source code into native code, what it will actually do is generate what's called an object file. When you change a source file, typically only the object file for that source file needs to be updated. To generate an executable with real, fully compiled code, the linker gathers all of the object files and uses them to generate an executable. The most common way of using third-party libraries is via shared libaries, which are actually a special type of object file. In the sections below, we'll examine some of these aspects in more detail.

Object Files

Object files contain native code (e.g. actual x86 machine code) in a "mostly compiled" format. The main reason the code is "mostly compiled" instead of "fully compiled" is that real executable machine code requires absolute memory addresses for function calls and global variables, and those absolute memory addresses aren't available in object files.

On the x86 architecture (including x86-64), function calls are made by specifying the 32-bit delta between the caller and the callee. For instance, suppose the current instruction is located at memory address 0x57a172, and we want to call another function located at memory address 0x577e50. With two's complement encoding, the 32-bit delta between these addresses is represented as the negative number 0xd9dcffff. The CALL instruction is encoded using the prefix 0xe8 followed by this delta, so the fully encoded instruction is 0xe8d9dcffff.

Other architectures may use absolute addressing, but the same problem still applies: to make a function call to an absolute address, we need to actually know what that address is. Because an object file just contains code for a fragment of the program, it won't have that information available until the link phase. Thus object files typically have slots for things like function calls and the addresses of global variables.

When the link phase occurs, the linker will take all of the object files, combine them, and then fill in these placeholder slots after all of the memory addresses of global symbols have been determined. It's worth noting here that this process can be very demanding on memory, particularly for C++ projects that make excessive use of templates. This is because linking a program requires loading all of the code into memory, plus a lot of intermediate data structures to hold the layout of objects. C++ templates are a form of code generation, and thus when using them it's possible to explode the amount of machine code the compiler is generating. Really big C++ projects (e.g. Google Chrome, LibreOffice) frequently require multiple gigabytes of memory to link. One good aspect of this design, however, is that this memory intensive phase is deferred to just the linking step. This means that the compiler uses a more modest amount of memory when compiling individual translation units.

Shared Libraries and the Dynamic Loader

Usually libraries are linked in as "shared objects". These are a lot like the regular object files just described, except they're [relocatable](https://en.wikipedia.org/wiki/Relocation_(computing)) in memory. This means that special machine code is generated which adds an extra indirection layer for accessing functions and global variables. The indirection makes it possible to link code that doesn't have absolute addresses available. This is how pretty much all applications link against libc, and usually other libraries as well.

The low-level implementation details of how this work are outside the scope for this post, but I want to paint the broad strokes here. As already described, to actually make a function call on x86 you need to know the absolute memory address of the function you're calling. Modern systems use something called address space layout randomization (ASLR) to randomize the location that shared libraries are loaded in memory. This is done for security, to prevent return oriented programming (ROP) security attacks. The idea here is that by loading shared libraries at a randomized location in memory, it becomes significantly more difficult (usually impossible) for the attacker to use a buffer overflow attack to make a nefarious return to a library like libc, e.g. to access files on the filesystem or gain additional privileges. The issue is that when ASLR is in effect, the compiler and linker don't know what the address of functions like printf() will be until the program is actually loaded and run, since these addresses will be randomized every time the program is loaded.

This problem is avoided by generating position-independent code (PIC). Here's how it usually works. When calling a method like printf(), the compiler actually generates code that looks up the address of printf() in a table in memory that holds the true address for the function. It would be slow if every function call had to go through a table lookup, so the way the compiler implements this is by using a technique similar to self-modifying code. The first time printf() is called, the code will call a function that looks up another jump in a table, using a fixed offset for that function. Initially the table is initialized with entries that always jump to stub loader. The stub loader looks up the true address of printf(), and then patches the jump table with the true offset to printf(). Thus the stub loader is only called once. This means that the first time a function is called this way it takes a little longer to run, but subsequently the jump table is correctly initialized and the cost of invoking the function is just one extra jump instruction. There are some linker options (e.g. ld -z now) that change this behavior a bit, as described further below. I just explained this at a very, very high level; if you want more details, I encourage you to read Eli Bendersky's excellent article about this subject, which goes into a lot more detail.

The lookup tables are loaded into memory by a special program called the dynamic linker. This program is also called a [loader](https://en.wikipedia.org/wiki/Loader_(computing)), since it loads programs. The dynamic linker special program that the kernel runs when a new executable is launched. The dynamic linker takes care of mapping shared libraries into memory at randomized locations, and bootstrapping the function lookup tables.

If this sounds interesting to you and you'd like to learn more of the details, I highly recommend reading Eli Bendersky's linkers and loaders series.

Header Files and Includes

In C and C++ programs you normally use the #include preprocessor directive to include header files for libraries. For instance, to use printf() you need a line like:

#include <stdio.h>

What's interesting about this is that unlike most other languages, the code in <stdio.h> does not have the implementation of printf(). Instead it just has the definition for the prototype of printf(). This is just enough information for the compiler to do a sanity check that all of the variables and functions you are using are actually defined somewhere, and to do type checking. The compiler assumes that what's in the header files matches what's in the actual object files you intend to link together.

The reason things work this way is that if <stdio.h> actually had the code for printf(), that code would be included in the translation unit for the file including it. This would cause printf() to be defined multiple times, which would result in a linker error.

Bad things will happen if the header files don't match what's actually in the object files. The two most common problems that I run into are that a particular function is defined by a header file, but for some reason isn't actually available in the object files; or that a given method or variable is accidentally defined in two or more object files, as in the printf() example above. Both of these situations will result in cryptic linker errors (especially if you're using C++ which uses name mangling).

There are other more obscure errors that can also happen when header files and object files are inconsistent. For instance, imagine you compiled a method foo() that accepts two parameters, but the header file says it only takes one parameter. The compiler does type checking purely based on what's in the header files, so it will insist that you call foo() with just one parameter. Then when the program actually runs it will be linked against a version of foo() that actually expects two parameters. In this situation the most likely outcome is that foo() accesses its second parameter out of a register that isn't properly initialized, causing the program to do something unpredictable.

Normally these kinds of problems arise when projects don't have their build systems properly configured. For this reason, I strongly encourage using a real build system like autoconf/automake or CMake for any non-trivial C or C++ projects. For very simple projects you can hand write your own Makefiles, but doing so for anything with more than a few source files tends to be error prone.

Modern Linker Advancements

Compilers are constantly evolving new ways to generate more optimized code. But the linker's job is much more prosaic: the linker mostly exists to glue together object files in the final phase of code generation. So you might think that linkers haven't changed a lot, or don't have any interesting, new technology in them. This is sort of true, and sort of not true. I want to highlight a few "recent" advancements in linkers to give you an idea of the state of the field.

Linker Hardening

In recent years linkers have gotten a few hardening options. The two important ones you'll use on a GNU/Linux system are ld -z relro, possibly in conjunction with ld -z now for extra hardened executables. Debian has an interesting post on these hardening features that can be used when building Debian packages.

The basic idea of these two options is that relocation data needs to be initially mapped as writable when a program launches. However, after the program starts some or all of that data can be remapped as read only. Without -z relro an attacker could potentially exploit something like a buffer overflow to remap entries in the relocation tables. By using -z relro the linker will remap all of the non-dynamic relocation sections as read only.

The -z now flag causes the loader to fully resolve all symbols when the program is started. When used in conjunction with -z relro this will cause the entire relocation table to be mapped read-only, which improves things even further. Note that there is a small runtime penalty to using this flag, since it makes process startup slower.

Link-Time Optimization

Traditionally compilers for C and C++ can only do code optimizations for code within a single [translation unit](https://en.wikipedia.org/wiki/Translation_unit_(programming)) (also called a "compilation unit"). A translation unit is basically all of the code that maps to a single object file, so usually this would be a .c file plus anything from other source files it includes.

A consequence of this is that traditional compilers that use object files cannot do "whole program optimization", since each translation unit is optimized independently. The most common example here of an optimization that might be missed this way is inlining: function calls that cross object file boundaries normally cannot be inlined. This is something that link-time optimization fixes.

When using the -flto flag with GCC, the compiler will include the full intermediate representation (IR) for each object file in a special section within the object file. Then during the final linking phase, the compiler and linker can use the IR in all of the object files to identify additional sites where inlining would be helpful. There are a few other optimizations available, such as more advanced dead code elimination than would be possible by just examining each translation unit independently.

This is a really promising area for new optimizations, and in my opinion is one of the most exciting features to make it into GCC in the last few years.

Other Languages

Other languages with native code use linkers of some kinds. I'm going to compare and contrast Go (which uses a more traditional linker design) with Java (which has a non-traditional linker).

Go

Everything described so far also applies to Go---it generates object files and has a linking phase. However, Go does this all automatically so you don't need to know about header files or worry about the details or your build system or linker options. Most Go programmers probably are not even aware that Go has a linker, but it does!

If you look at $GOPATH/pkg/ you'll find the static object archives generated by Go when it compiles code. You should see a lot of .a files in this directory hierarchy:

find $GOPATH/pkg -type f -name '*.a'

One interesting thing about Go is that it has its own linker and assembler, descended from the Plan 9 toolchain. Russ Cox has some interesting commentary on Hacker News about why Go has its own linker instead of using one of the standard ones like GNU ld or [gold](https://en.wikipedia.org/wiki/Gold_(linker)).

Java

Java does not generate traditional object files and does not have a traditional linker, but under the hood it's doing the same thing. Java projects get compiled into JVM bytecode, packaged in the Java archive (jar) format. When the JIT'ing code, the JVM needs to do the same tasks as a compiler and linker: it has to put real executable machine code in memory with valid jumps and call instructions.

Traditional compiled languages like C and C++ put as much of this logic as possible in the regular linker, and defer the small amount of remaining work to the dynamic loader. The JVM does all of these linking tasks at runtime. This is one reason why Java processes take longer to start up than compiled C or C++ programs.

Recent Android releases use a technique they call Android runtime (ART) which allows the bytecode for Android applications to be compiled ahead of time, just like C, C++, and Go programs. This is literally done by a real compiler and linker built into Android, based on LLVM. This makes Android "Java" applications work much like traditional C or C++ programs.

Rust

Rust doesn't use traditional ELF object files, but instead has a functionally equivalent concept called a crate. However, the Rust compiler can be asked to emit a traditional object file like so:

rustc --emit obj foo.rs

This will cause the Rust compiler to emit a regular object file called foo.o. This object file can then be linked with a regular linker. This makes it easy to call Rust code from native C code, since the Rust code can be compiled into an object format that a C compiler/linker can understand.