Call Stacks and Link Registers

When your computer executes code, there's an implicit call stack at any given point in time. The call stack holds the necessary information for functions to return to their caller. For instance, let's say we have this C code:

void foo(int n) {
  for (int i = 0; i < n; i++) {
    printf("i = %d\n", i);
  }
}

After each call to printf(), the routine needs to return to a specific place in foo(). With nesting of functions this gets even more complicated:

void bar(int n) {
  if (n % 2) {
    foo(n);
    puts("odd branch");
  } else (
    foo(n / 2);
    puts("even branch");
  }
}

Now printf() needs to remember how to return to foo(), and foo() has to remember how to return to two different places within bar(). And presumably there's some other function that's called bar(), and bar() needs to know how to return to that.

The obvious data structure to hold this information is a stack. The way this might work is you'd push a return address onto this stack when making a function call, and then when you wanted to return to your caller you'd pop the return value off the stack.

In fact, this is exactly how pretty much every computer architecture works. The complication, however, is that there isn't actually a dedicated stack for call information. Instead the regular program stack is used to store the return addresses. Thus these return addresses are intermingled with regular program data on the stack.

Specifically here's how it works in x86 assembly. You call a function like this:

call function_or_address           # function call

The call does two things: it pushes the the address of the next instruction onto the stack, and then it jumps to the call address. Later, when the call site is ready to return, it executes the return instruction:

ret                                # return to caller

This instruction also does two things: it pops the top of the stack, and then jumps to the address that was popped off.

Link Registers

This is all pretty reasonable so far: since function calls can have arbitrary nesting, we need to use a stack of some kind to store the function call stack data. There's not a whole lot of other options.

Actually, there's a neat trick that basically every modern architecture other than x86 implements. The technique is to use something called a "link register", henceforth called a LR. On some platforms there's a special register for this (e.g. the LR in ARM is actually called LR) while on some architectures it's done by convention (e.g. x1 is used for this purpose on RISC-V). The basic idea is that the return address for a function will be stored in this register, and then a "return" instruction will jump to the location pointed to by that register.

But wait: how does this work with function nesting? Imagine a calls b, and b calls c. In the call to c the LR will be set to some address in b. Then after the call completes the LR will have been clobbered, and b will have forgotten how to return to a.

The answer is that since the LR is indeed clobbered, in the general case you need to save the LR somehow before making another function call. Typically you would save the LR by pushing it onto the stack before the function call, and popping it after---just like we described for the x86 architecture. So we're back to where we started. Or are we?

Even with this restriction, one nice property of the LR system is that if a function calls multiple other functions we only need to save the LR once. For example, imagine this code (in a hypothetical assembly language):

example:
  push %lr                         # save the LR
  call foo                         # call foo()
  call bar                         # call bar()
  pop %lr                          # restore the LR
  ret                              # return to caller

In this example we did two function calls but only had to do one save/restore. So things are already better than x86. Pretty great, right?

Additionally, "leaf" functions that don't call other methods don't need to touch the stack at all, since all of the data they need will have been loaded directly into CPU registers.

The downside to this approach is that the explicit saves/restores of the LR mean that there are more instructions to execute. This is generally true of CISC vs RISC architectures: the CISC design leads to more compact instruction sequences, but (hopefully) the extra simplicity of the RISC architecture offsets that by allowing for faster or more efficient instructions.