repz stosb

Towers of Hanoi in x86 Assembly

One of the things I leaned this year is x86 assembly. The way that I learned it is wasn't actually by writing it, but by reading a lot of disassembly output. This means that although I'm pretty proficient at following along in assembly listings, I'm not that good at actually writing it from scratch. I decided to try to correct that, and today I wrote my first real x86 assembly program.

I decided to implement Towers of Hanoi for a few reasons:

The last point was especially important because I wanted to write my solution without linking against libc.

Throughout this post I'll have sample listings from my solution. If you want to see the whole thing though (say, you'd like to compile it yourself) you can find it here on GitHub.

This blog is a bit long, and you may want to skip towards the end. The structure of this post follows:

Getting Started: Towers Of Hanoi In C

I decided to start my journey by first implementing Towers of Hanoi in C. I did this paying special care to write it in a way that I though would translate easily into assembly. In particular, I wanted to minimize the calls I made to libc. I ended up only using two calls: calloc() to allocate my arrays, and putchar() for printing.

I have three towers, each is represented as a struct tower allocated on the stack that's like this:

struct tower {
  int size;
  int *stack;

The size field holds the number of rings stored in that tower's stack, and the stack field is a pointer to an array holding the stack. The base of the array is the value of the width of the bottom-most ring, and moving up in memory yields the values for the successive rings.

I define three operations: peek(), pop(), and push(). They have very straightforward implementations:

// return the value of the top ring
int peek(struct tower *tower) {
  if (tower->size) {
    return tower->stack[tower->size - 1];
  return INT_MAX;

// remove the top ring and return its value
int pop(struct tower *tower) {
  int val = tower->stack[tower->size - 1];
  tower->stack[tower->size - 1] = 0;
  return val;

// push a new ring (without error checking)
void push(struct tower *tower, int val) {
  tower->stack[tower->size] = val;

I implemented an iterative solution to Towers of Hanoi, which is the most intuitive way to think about the problem. This is a good explanation of the algorithm, with diagrams. The idea is that there are three towers: the source tower, the aux tower, and the dest tower. The source tower is the one that starts out with rings. The other two towers are the aux/dest towers, and which one is which depends on if there is an even or odd number of rings.

The algorithm works by successively picking pairs of towers and then doing whatever is the valid move between those two towers. So given a pair, only one move will be legal, and the algorithm picks the legal move. That logic is implemented here by move():

// make the only valid move between a/b
void move(struct tower *a, struct tower *b) {
  if (peek(a) > peek(b)) {
    push(a, pop(b));
  } else {
    push(b, pop(a));

The algorithm itself is like:

The algorithm takes (1<<N) - 1 moves for N rings, so we maintain a counter and stop after enough moves have passed. The full code for the solve loop is like this:

int solve(int n) {
  struct tower s, a, d;
  init_tower(&s, n);
  init_tower(&a, n);
  init_tower(&d, n);

  int i;
  for (i = 0; i < n; i++) {
    push(&s, n - i);

  i = 0;
  int total_moves = (1 << n) - 1;
  for (;;) {
    print_all(&s, &a, &d, n);
    move(&s, &d);
    if (++i == total_moves) {
    print_all(&s, &a, &d, n);
    move(&s, &a);
    if (++i == total_moves) {
    print_all(&s, &a, &d, n);
    move(&a, &d);
    if (++i == total_moves) {
  print_all(&s, &a, &d, n);
  return 0;

There's one last thing needed here, the code for print_all(). I'm not going to include that here since it's boring, but the idea is you print the towers in order SDA if N is even, and in order SAD if N is odd.

Getting Started: Towers Of Hanoi In x86 Assembly

For the x86 code I implemented a similar solution. Since I don't have a lot of experience writing x86 I did things really weirdly, and have a calling convention kind of like the C one, but different.

The solve loop starts like this (with the number of rings in %rdi):

  push %rdi
  mov %rsp, %rbp

  # S tower
  sub $64, %rsp
  mov %rsp, %rdi
  call init_tower

  # A tower
  sub $64, %rsp
  mov %rsp, %rdi
  call init_tower

  # D tower
  sub $64, %rsp
  mov %rsp, %rdi
  call init_tower

So basically the value at %(rbp) will be the ring count, -64(%rbp) is the first tower, -128(%rbp) is the second tower, and -192(%rbp) is the third tower.

The init_tower method zeroes the 64 bytes starting at %rdi:

###  zero the memory in a tower
  xor %rax, %rax                # zero %rax
  movl $64, %ecx
  rep stosb

Next I need to initialize the rings in the first tower:

  # initialize S tower
  lea -64(%rbp), %rax
  mov (%rbp),%rcx
  mov %rcx, (%rax)              # set the size of the tower
  add $4, %rax
  mov %rcx, (%rax)
  add $4, %rax
  loop .init_s

This copies the ring count into the first 4 bytes of the first tower, and then for each 4 byte integer after that in descending order it stores the ring count. So let's say we're doing Towers Of Hanoi with N=3. Then the first tower will have memory initialized like {3, 3, 2, 1, 0, 0, ...}. The leading value is the number of rings, and then each value to the left is the width of the next ring.

Next I prepare to loop through all of the possible moves. I did this by storing the loop variable in %r15 and the total number of rings in %r14:

  mov (%rbp), %cl               # copy N into %cl
  mov $1, %r14                  # shift operand
  shl %cl, %r14                 # 1 << N
  dec %r14                      # now %r14 holds (1<<N)-1
  xor %r15,%r15                 # the loop variable

The core loop looks the same as it does in the C version. There's three similar blocks that each look like this:

  lea -64(%rbp), %rdi           # A tower
  lea -192(%rbp), %rsi          # D tower
  call move_tower

  inc %r15                      # increase loop counter
  cmp %r14, %r15                # compare to loop end
  jge .leave_solve              # leave if done

This one does the iteration checking S and D. The other blocks are the same except the first two lea instructions will be different to refer to S and A or A and D.

The next function is move_tower, which is the same as move() from the C code. It moves a ring from the tower with the smaller ring to the tower with the larger ring:

### rdi and rsi should be towers to move between
  ##  compare the top rings in the two towers
  mov %rdi, %r9
  call peek_tower
  mov %rax, %r10
  mov %rsi, %rdi
  call peek_tower
  mov %r9, %rdi
  cmp %rax, %r10
  jl .less_branch
  ## swap rdi and rsi
  mov %rdi, %rax
  mov %rsi, %rdi
  mov %rax, %rsi
  ## source is rdi, dest is rsi
  call pop_tower
  push %rdi
  push %rsi
  mov %rsi, %rdi
  mov %rax, %rsi
  call push_tower
  pop %rsi
  pop %rdi
  jmp print_all

Here are peek_tower, pop_tower, and push_tower. They're all very straightforward translations from the C code:

###  peek the top of the tower
  movl (%rdi),%ecx
  cmpl $0, %ecx
  jz .peek_empty
  dec %ecx
  movq 4(%rdi, %rcx, 4), %rax
  movl $100000, %eax

### like peek tower, but remove the value
  movl (%rdi),%ebx
  dec %ebx
  movl 4(%rdi, %rbx, 4), %eax   # copy the value to %rax
  movl $0, 4(%rdi, %rbx, 4)     # zero the value
  mov %ebx, (%rdi)              # decrease count

### the tower in %rdi, the val in %rsi
  movl (%rdi),%ecx
  mov %rsi, 4(%rdi, %rcx, 4)
  inc %ecx
  mov %ecx, (%rdi)

That's pretty much it. Most of the rest of the code is logic for printing out the towers, and there's too much boiler plate to include here. The core of it is this method:

  mov %rdi, %rsi
  movl $1, %eax                 # SYS_write
  movl %eax, %edi               # fd = 1
  movl %eax, %edx               # n = 1

This prints the 1-byte buffer at %rdi. To convert ring values to characters I add the value of '0', same as in the C code. Printing a tower involves calling a loop with print_one pointing to a ring and then print_one pointing to a space. After the loop I call print_one with it pointing to a newline. The same trick of checking if there is an even or odd number of rings is needed in the print_all method, and I do that by checking the value in %(rbp) which is the same as when the program started.


GDB Stuff

It actually took me quite a while to get the tower initialization and printing code up, which is the first code that I wrote. Since I didn't have the printing code working the only recourse I had for debugging was using GDB. I ended up staying in GDB for pretty much the whole writing/debugging process and learned a few things.

The first piece of advice I have is to use the layout regs mode in GDB when debugging assembly. It makes it really easy to singe-step and figure out what's changing when. I made a few early mistakes with misunderstanding which registers would be clobbered in a few situations, and the layout regs mode made it possible to solve this problem.

Another thing I found useful with GDB is inserting function labels at various places. You can put a function label "inside" another function and the code will fall through when calling the original function. This makes getting break points at specific parts of assembly routines really easy: insert a label with any name, and then set up a GDB break point at that label. The only downside I found to this is that it makes disassembly a bit strange, since if you disassemble the original function the disassembly output stops at the label you created.

I got to use the reverse debugging mode in GDB for the first time while writing this, to debug a segfault. The idea is that you record at some point before your program crashes. Then you can use reverse-stepi and reverse-stepn to backtrack from where the segfault happens. My segfault ended up being due to a misadjusted stack pointer when executing a ret instruction.

Code Lessons

I kind of made up my own calling convention as I wrote the program. It's similar to the regular x86 C calling convention but in some situations I decided to clobber extra registers when I knew the callers didn't need them. That works for a program that's really small like this, but it's not ideal. And it won't work at all if you need regular C interop.

In retrospect the program would almost certainly have been simpler had I just followed the C calling convention exactly.

I also neglected to use frame pointers, and I think that would have also made debugging simpler. Certainly the GDB output would have been more understandable.

I found the 32-bit/64-bit stuff really difficult to get right. A number of times I made mistakes where I used 64-bit values in 32-bit contexts and ran into problems. I still don't fully understand the rules on when I can use the 32-bit forms and when I have to use the 64-bit forms. This is something I need to come back to.

Next Steps

I plan to compare the code that GCC generated at optimizations levels -O0 and -O2 to my own code to see where I could improve. My initial thoughts are "probably everywhere" since even as I was writing this code I recognized a bunch of areas where I was doing redundant loads. I'm also going to start thinking of another program to do in x86, hopefully something more complicated with more memory management.