One of the things I leaned this year is x86 assembly. The way that I learned it is wasn't actually by writing it, but by reading a lot of disassembly output. This means that although I'm pretty proficient at following along in assembly listings, I'm not that good at actually writing it from scratch. I decided to try to correct that, and today I wrote my first real x86 assembly program.
I decided to implement Towers of Hanoi for a few reasons:
- It's complicated enough to be interesting
- It's simple enough that I knew I'd be able to do it
- It doesn't require any complicated memory management or string formatting
The last point was especially important because I wanted to write my solution without linking against libc.
Throughout this post I'll have sample listings from my solution. If you want to see the whole thing though (say, you'd like to compile it yourself) you can find it here on GitHub.
This blog is a bit long, and you may want to skip towards the end. The structure of this post follows:
- We'll tour the C solution
- We'll see the interesting bits of the x86 solution
- I'll discuss some learnings from the experience
Getting Started: Towers Of Hanoi In C
I decided to start my journey by first implementing Towers of Hanoi in C. I did
this paying special care to write it in a way that I though would translate
easily into assembly. In particular, I wanted to minimize the calls I made to
libc. I ended up only using two calls: calloc()
to allocate my arrays, and
putchar()
for printing.
I have three towers, each is represented as a struct tower
allocated on the
stack that's like this:
struct tower {
int size;
int *stack;
};
The size
field holds the number of rings stored in that tower's stack, and the
stack
field is a pointer to an array holding the stack. The base of the array
is the value of the width of the bottom-most ring, and moving up in memory
yields the values for the successive rings.
I define three operations: peek()
, pop()
, and push()
. They have very
straightforward implementations:
// return the value of the top ring
int peek(struct tower *tower) {
if (tower->size) {
return tower->stack[tower->size - 1];
}
return INT_MAX;
}
// remove the top ring and return its value
int pop(struct tower *tower) {
assert(tower->size);
int val = tower->stack[tower->size - 1];
tower->stack[tower->size - 1] = 0;
tower->size--;
return val;
}
// push a new ring (without error checking)
void push(struct tower *tower, int val) {
tower->stack[tower->size] = val;
tower->size++;
}
I implemented an iterative solution to Towers of Hanoi, which is the most intuitive way to think about the problem. This is a good explanation of the algorithm, with diagrams. The idea is that there are three towers: the source tower, the aux tower, and the dest tower. The source tower is the one that starts out with rings. The other two towers are the aux/dest towers, and which one is which depends on if there is an even or odd number of rings.
The algorithm works by successively picking pairs of towers and then doing
whatever is the valid move between those two towers. So given a pair, only one
move will be legal, and the algorithm picks the legal move. That logic is
implemented here by move()
:
// make the only valid move between a/b
void move(struct tower *a, struct tower *b) {
if (peek(a) > peek(b)) {
push(a, pop(b));
} else {
push(b, pop(a));
}
}
The algorithm itself is like:
- make the valid move between S and D
- make the valid move between S and A
- make the valid move between A and D
- repeat...
The algorithm takes (1<<N) - 1
moves for N
rings, so we maintain a counter
and stop after enough moves have passed. The full code for the solve loop is
like this:
int solve(int n) {
struct tower s, a, d;
init_tower(&s, n);
init_tower(&a, n);
init_tower(&d, n);
int i;
for (i = 0; i < n; i++) {
push(&s, n - i);
}
i = 0;
int total_moves = (1 << n) - 1;
for (;;) {
print_all(&s, &a, &d, n);
move(&s, &d);
if (++i == total_moves) {
break;
}
print_all(&s, &a, &d, n);
move(&s, &a);
if (++i == total_moves) {
break;
}
print_all(&s, &a, &d, n);
move(&a, &d);
if (++i == total_moves) {
break;
}
}
print_all(&s, &a, &d, n);
return 0;
}
There's one last thing needed here, the code for print_all()
. I'm not going to
include that here since it's boring, but the idea is you print the towers in
order SDA if N is even, and in order SAD if N is odd.
Getting Started: Towers Of Hanoi In x86 Assembly
For the x86 code I implemented a similar solution. Since I don't have a lot of experience writing x86 I did things really weirdly, and have a calling convention kind of like the C one, but different.
The solve loop starts like this (with the number of rings in %rdi
):
solve:
push %rdi
mov %rsp, %rbp
# S tower
sub $64, %rsp
mov %rsp, %rdi
call init_tower
# A tower
sub $64, %rsp
mov %rsp, %rdi
call init_tower
# D tower
sub $64, %rsp
mov %rsp, %rdi
call init_tower
So basically the value at %(rbp)
will be the ring count, -64(%rbp)
is the
first tower, -128(%rbp)
is the second tower, and -192(%rbp)
is the third
tower.
The init_tower
method zeroes the 64 bytes starting at %rdi
:
### zero the memory in a tower
init_tower:
xor %rax, %rax # zero %rax
movl $64, %ecx
rep stosb
ret
Next I need to initialize the rings in the first tower:
# initialize S tower
lea -64(%rbp), %rax
mov (%rbp),%rcx
mov %rcx, (%rax) # set the size of the tower
add $4, %rax
.init_s:
mov %rcx, (%rax)
add $4, %rax
loop .init_s
This copies the ring count into the first 4 bytes of the first tower, and then
for each 4 byte integer after that in descending order it stores the ring count.
So let's say we're doing Towers Of Hanoi with N=3
. Then the first tower will
have memory initialized like {3, 3, 2, 1, 0, 0, ...}
. The leading value is the
number of rings, and then each value to the left is the width of the next ring.
Next I prepare to loop through all of the possible moves. I did this by storing
the loop variable in %r15
and the total number of rings in %r14
:
mov (%rbp), %cl # copy N into %cl
mov $1, %r14 # shift operand
shl %cl, %r14 # 1 << N
dec %r14 # now %r14 holds (1<<N)-1
xor %r15,%r15 # the loop variable
The core loop looks the same as it does in the C version. There's three similar blocks that each look like this:
.solve_loop:
lea -64(%rbp), %rdi # A tower
lea -192(%rbp), %rsi # D tower
call move_tower
inc %r15 # increase loop counter
cmp %r14, %r15 # compare to loop end
jge .leave_solve # leave if done
This one does the iteration checking S and D. The other blocks are the
same except the first two lea
instructions will be different to refer to S
and A or A and D.
The next function is move_tower
, which is the same as move()
from the C
code. It moves a ring from the tower with the smaller ring to the tower with the
larger ring:
### rdi and rsi should be towers to move between
move_tower:
## compare the top rings in the two towers
mov %rdi, %r9
call peek_tower
mov %rax, %r10
mov %rsi, %rdi
call peek_tower
mov %r9, %rdi
cmp %rax, %r10
jl .less_branch
.greater_branch:
## swap rdi and rsi
mov %rdi, %rax
mov %rsi, %rdi
mov %rax, %rsi
.less_branch:
## source is rdi, dest is rsi
call pop_tower
push %rdi
push %rsi
mov %rsi, %rdi
mov %rax, %rsi
call push_tower
pop %rsi
pop %rdi
jmp print_all
Here are peek_tower
, pop_tower
, and push_tower
. They're all very
straightforward translations from the C code:
### peek the top of the tower
peek_tower:
movl (%rdi),%ecx
cmpl $0, %ecx
jz .peek_empty
dec %ecx
movq 4(%rdi, %rcx, 4), %rax
ret
.peek_empty:
movl $100000, %eax
ret
### like peek tower, but remove the value
pop_tower:
movl (%rdi),%ebx
dec %ebx
movl 4(%rdi, %rbx, 4), %eax # copy the value to %rax
movl $0, 4(%rdi, %rbx, 4) # zero the value
mov %ebx, (%rdi) # decrease count
ret
### the tower in %rdi, the val in %rsi
push_tower:
movl (%rdi),%ecx
mov %rsi, 4(%rdi, %rcx, 4)
inc %ecx
mov %ecx, (%rdi)
ret
That's pretty much it. Most of the rest of the code is logic for printing out the towers, and there's too much boiler plate to include here. The core of it is this method:
print_one:
mov %rdi, %rsi
movl $1, %eax # SYS_write
movl %eax, %edi # fd = 1
movl %eax, %edx # n = 1
syscall
ret
This prints the 1-byte buffer at %rdi
. To convert ring values to characters I
add the value of '0'
, same as in the C code. Printing a tower involves calling
a loop with print_one
pointing to a ring and then print_one
pointing to a
space. After the loop I call print_one
with it pointing to a newline. The same
trick of checking if there is an even or odd number of rings is needed in the
print_all
method, and I do that by checking the value in %(rbp)
which is the
same as when the program started.
Learnings
GDB Stuff
It actually took me quite a while to get the tower initialization and printing code up, which is the first code that I wrote. Since I didn't have the printing code working the only recourse I had for debugging was using GDB. I ended up staying in GDB for pretty much the whole writing/debugging process and learned a few things.
The first piece of advice I have is to use the layout regs
mode in GDB when
debugging assembly. It makes it really easy to singe-step and figure out what's
changing when. I made a few early mistakes with misunderstanding which registers
would be clobbered in a few situations, and the layout regs
mode made it
possible to solve this problem.
Another thing I found useful with GDB is inserting function labels at various places. You can put a function label "inside" another function and the code will fall through when calling the original function. This makes getting break points at specific parts of assembly routines really easy: insert a label with any name, and then set up a GDB break point at that label. The only downside I found to this is that it makes disassembly a bit strange, since if you disassemble the original function the disassembly output stops at the label you created.
I got to use the reverse debugging mode in GDB for the first time while writing
this, to debug a segfault. The idea is that you record
at some point before
your program crashes. Then you can use reverse-stepi
and reverse-stepn
to
backtrack from where the segfault happens. My segfault ended up being due to a
misadjusted stack pointer when executing a ret
instruction.
Code Lessons
I kind of made up my own calling convention as I wrote the program. It's similar to the regular x86 C calling convention but in some situations I decided to clobber extra registers when I knew the callers didn't need them. That works for a program that's really small like this, but it's not ideal. And it won't work at all if you need regular C interop.
In retrospect the program would almost certainly have been simpler had I just followed the C calling convention exactly.
I also neglected to use frame pointers, and I think that would have also made debugging simpler. Certainly the GDB output would have been more understandable.
I found the 32-bit/64-bit stuff really difficult to get right. A number of times I made mistakes where I used 64-bit values in 32-bit contexts and ran into problems. I still don't fully understand the rules on when I can use the 32-bit forms and when I have to use the 64-bit forms. This is something I need to come back to.
Next Steps
I plan to compare the code that GCC generated at optimizations levels -O0
and
-O2
to my own code to see where I could improve. My initial thoughts are
"probably everywhere" since even as I was writing this code I recognized a bunch
of areas where I was doing redundant loads. I'm also going to start thinking of
another program to do in x86, hopefully something more complicated with more
memory management.