A Review Of The Intel/x86 KPTI Fiasco

In the last week, there has been a lot of attention drawn to the KPTI (formerly called KAISER) patches being integrated into the Linux kernel. Here are some of the prominent articles that have been circulating:

This is a very technical issue, as it involves low level things related to page tables, out-of-order execution pipelines, and the translation lookaside buffer (TLB). I want to attempt to explain the issue (as I understand it) in terms that are accessible to programmers who aren't already well versed in these topics.

How Virtual Memory Works on x86

The most important safety/abstraction feature provided by modern CPUs and operating systems is virtual memory. The basic idea of virtual memory is that the memory of every process is isolated from the memory of other processes. This means that an errant program can crash itself, but can't corrupt kernel memory or the memory of another running process.

This section will introduce some background material about how virtual memory is actually implemented. After covering this material, I'll explain how it's related to the suspected Intel bug.

Page Tables

To implement virtual memory, the kernel maintains a map from virtual memory address ranges to physical addresses. A virtual memory address is a word sized value, so that would be 64-bits on an x86-64 system. The physical address is the actual address in the physical RAM chips on the system. This mapping is stored in a data structure called a page table. The page table is arranged as a shallow tree that can be traversed to get a PTE ("page table entry"). A PTE says something like "the 4096 bytes of memory starting at 0x7fa22145e000 should be mapped with read-only permissions to physical address 0x59658ea0".

A typical read instruction in x86 will look like this (AT&T syntax):

;; copy the data pointed to by %rax into %rdi
mov (%rax), %rdi

In this case, the %rax register holds a memory address like 0x7fa22145e080, and the instruction says to copy the data held at that memory address into another register, %rdi in this case. To actually execute this instruction, the CPU needs to translate the memory address into a physical address. To do this, it walks the page table to find the translation from 0x7fa22145e080 to the right physical address, and also checks the permissions on the mapping (i.e. to ensure the mapping has read permissions). The logic that walks the page table is actually built into the CPU itself (not the kernel), in a special hardware unit called the MMU. The x86 architecture manuals from Intel and AMD explain how the page table needs to be arranged and encoded in memory. The kernel makes sure the page table it set up as the MMU expects, and then the MMU itself has silicon to do page table traversals.

The Translation Lookaside Buffer (TLB)

Modern x86 page tables are up to five levels deep, meaning that the virtual to physical translation requires walking a tree of height five. This means quite literally that if you want to read one memory address, first you need to do at least five other memory accesses to walk the page table. If that's all there was to it, that would mean that there was a 5x overhead in reading from memory, which would be abysmal. There's an easy solution to this problem though: caching! CPUs have a special hardware unit in the MMU called a TLB (translation lookaside buffer). The TLB caches results from the page table using a special type of computer memory called CAM that effectively implements a hash map in silicon, making TLB lookups ultra fast. The result is that a page table traversal only happens on a TLB miss.

Context Switches and TLB Flushes

On a system with virtual memory, processes are supposed to be fully independent. This means that two processes can use the same virtual memory addresses. As an example, on a 64-bit Linux system, if you compile a regular C program you'll typically see that main() is defined to be at or very near memory address 0x400000. This means that if you're running multiple processes on a host, there's a good chance that many of them map 0x400000 as part of the "text" area of the process. If these processes are running different executables, the virtual memory address 0x400000 will be mapped to different physical addresses for different processes.

Since multiple processes can map the same virtual memory addresses, there can't just be a single page table for the entire CPU. Instead, each process gets its own, fully independent page table. That way two processes can map the same virtual memory address to different physical addresses. On x86 this is implemented with a special register (CR3) that points to the active page table for the processor. If the kernel wants to switch between process A and process B, it has to interrupted process A, flush the TLB, and then schedule process B. The TLB flush is required to ensure that there's no residual TLB state from process A once process B is scheduled. If a TLB flush did not happen, then process B might be able to read or write to memory that belongs to process A. In general, when you switch from one user process to another you have to do a TLB flush.

Minimizing TLB Flushes During Context Switches

Flushing the TLB is expensive, because it means that subsequent memory accesses will need to do the whole complicated page table walking thing. Therefore you only want to do a TLB flush when absolutely necessary.

As just described, when switching from process A to process B, a TLB flush is always necessary. But what if we're just switching back and forth from a single process to the kernel? Can a user process share the TLB with the kernel, or does a user to kernel context switch also require a TLB flush?

As a motivating example, consider a dedicated MySQL server. On a dedicated host, the MySQL daemon process will be the userspace program running 99% of the time. Databases need to make a lot of context switches to read data from disk. On a system like this, there will be a huge performance win if the MySQL daemon can share the TLB with the kernel.

The x86-64 Solution

There are a few different ways to make user/kernel context switches not require a TLB flush. For instance, we could have two TLBs: one for the kernel, and one for userspace. Another way to do this would be to set a flag so the TLB knows if it's in kernel mode or user mode, and then the TLB would cross-check that flag against the cached entry on a hit. But the simplest way of all would be to just divide up the virtual memory space into a kernel region and a user region.

The exact way this works architecture dependent, but x86 systems all do some variation of the last solution, meaning they divide up the memory space. I'll consider legacy 32-bit x86 systems first, since the problem is more pronounced there. Legacy systems often reserve the first 2 GB of memory (0 to 0x7fffffff) for userspace, and the upper 2 GB of memory for the kernel (0x80000000 to 0xffffffff). This works OK until you purchase a system with 4 GB of memory, and then you find out that your userspace processes can still only map a measly 2 GB of memory. Later systems changed it to a 3GB/1GB split which is a little better (but not great). Later generation 32-bit Intel systems got this weird system called PAE adds more hacks to cram enough bits in for 32-bit systems. The whole thing is kind of a nightmare though, it's complicated and tries to force 32-bit systems to do something they just were not designed to do.

Other 32-bit embedded architectures saw how bad this problem was, and redesigned their MMU as a result. But on x86 everyone moved to 64-bit CPUs before redesigning the MMU to work with user/kernel tables became necessary. This is because 64-bits is enough to address 16 exabytes of memory. As a refresher, the scale goes: kilobyte, gigabyte, terabyte, petabyte, exabyte. You'll be hard pressed to find a server with 1 TB of RAM in it today, much less 1 PB or 1 EB. x86 systems take advantage of this, and split up the virtual memory space into different regions. Currently x86 CPUs only recognize 48-bits for virtual memory addresses and 52-bits for physical memory addresses.

Operating systems like Linux take advantage of this to split up virtual memory regions. The Linux kernel documentation describing x86-64 mappings can be found here. The exact details aren't super important: the main thing to notice is that memory addresses that start with 0xff are for the kernel, and memory addresses that start with 0x00 are for userspace. This means that on x86-64 systems, a TLB flush is not needed during a user to kernel context switch, because there's no ambiguity between userspace virtual memory addresses and kernel virtual memory addresses. This is basically a new spin on the 2:2 or 3:1 user/kernel split from 32-bit architectures. Except it actually kind of works, since 64-bits is so big.

The Current Intel x86 Bug

The KPTI patches that are being worked on make two major changes. The first is making the kernel have a separate page table for kernel mappings and user mappings. The second change is to force a TLB flush when switching between user mode and kernel mode. These changes suggest that there's some kind of problem with user/kernel TLB isolation, implying that there's a way from userspace to read kernel TLB entries residual from the kernel. Additionally, the original KAISER paper was about bypassing KASLR, so that is a known attack vector. I previously wrote about ASLR here; KASLR is just ASLR applied to kernel memory.

I'm going to switch gears a bit, and speculate about what this likely means from a real world perspective. Many people have suggested that the bug is related to a problem discovered by Anders Fogh, outlined in this blog post. This speculation is based off the fact that Anders Fogh is cited in the original KAISER paper that led to the development of the KPTI patches. The details are complicated, and Anders only gives an outline of the problem in his post, but here's how I understand the flaw. Based off instruction timing, it's possible from userspace to figure out if kernel address X is mapped in the TLB. Once you've found such an address, it may be possible to do some things with out-of-order speculative instructions to trick the kernel into reading address X for you and loading it into one of your registers.

Impact Of Reading Kernel Memory

If Anders' post is right, at a minimum it means that you would be able read arbitrary kernel memory mapped in the TLB. Since you can partially control what code the kernel runs by making system calls, you could potentially force the kernel to load various addresses into the TLB. This might mean you could read most (or all) of kernel memory.

Reading arbitrary kernel memory is a pretty big deal, because it becomes an attack vector to read disk or network buffers that you normally don't have access to. For instance, you might be able to read private keys or access credentials from files that are supposed to only be readable by root. Maybe you could read someone's Bitcoin or SSH private keys. There are a lot of possibilities.

Row Hammer

Back in 2014, a paper was published by CMU researchers describing a problem called "row hammer". The basic idea is that the extreme storage density of modern DRAM chips has led to a problem where in some cases writing specially crafted data to memory in a loop could cause bit flips in physically adjacent memory cells. This was first successfully exploited by Project Zero researchers in 2015. The Project Zero attack worked by using row hammer to flip page table permission bits. The basic idea is that you first figure out where your own page table is loaded into memory, and then try to flip PTE bits using row hammer. A successful attacking process can flip the write permission bit on its own page table. Then the process can update its own page table to give itself unlimited read/write access to the entire system. That's game over, because once you can write to arbitrary memory addresses you have full control over the kernel.

Since the first paper was published, there's been something of a cat-and-mouse game where security researchers have been finding new ways to cause row hammer attacks, and operating system and hardware vendors have been trying to mitigate such attacks. For instance, a paper titled Another Flip in the Wall of Rowhammer Defenses was published on October 2, 2017, including authors from Graz UT. These are the same people behind KAISER and the KPTI patches, hence speculation that row hammer and KPTI are related.

Tin-foil hat time. It's possible that existing Intel CPUs are susceptible to some kind of super row hammer attack. The super row hammer attack would be something that's very effective at bit flipping, and very difficult to mitigate against. This would mean you can do more than just read arbitrary kernel memory: you can also write to arbitrary kernel memory. That would be very, very bad. The absolute worst case nightmare scenario is one where a Javascript program could trigger the super row hammer bug. That would make it possible for Javascript code to gain root privileges on most or all Intel systems. In this nightmare scenario, the entire world becomes a huge Intel botnet basically overnight.

If a super row hammer attack does exist, it can at least be partially mitigated by the KPTI patches. It is not a full mitigation, but enough to avoid the scariest scenarios.

Hypervisor Exploits

The other super bug possible here is a hypervisor exploit, which was suggested as a pet theory in the original pythonsweetness Tumblr post. I am not well enough informed about how x86 hypervisors work to comment on the plausibility of this. But let's consider what it means anyway.

If there is a hypervisor exploit, it would potentially mean that Linux cloud providers like Amazon AWS and Google GCP could be compromised by their customers. As a tenant, you could potentially gain read or read/write access to other VMs colocated on the same physical server as your VM instance. This would be pretty catastrophic, since it would mean anyone with an account on AWS or GCP could start compromising VMs from other tenants.

Whether or not this is related to the KPTI patches, it's an extremely scary scenario that everyone should be aware of. Given recent x86 bugs, it seems totally plausible that a hypervisor exploit is found at some point in the future. If you are using a cloud provider, it's at least something you should be aware of as a serious weak point of using cotenant cloud instances.

Conclusions

Right now we have to wait and see what security researchers publish in the future. Things might be really bad, or they might not be---we just don't know yet. However, if you have Intel CPU systems its definitely a good idea to make sure that you make a best effort of upgrading kernel versions until this issue is either resolved, or blows over.