I saw a recent thread on Hacker News about memory access latency, and saw some interesting, but in my opinion misinformed, comments about NUMA processors. At work I've had to do a fair amount of work optimizing workloads for both NUMA and single-socket CPUs, and I've seen our hardware evolve through multiple revisions including multi-socket NUMA and single-socket architectures. So I'm far from an expert in NUMA, but I feel like I have some hands on experience.
Traditionally NUMA has generally referred to architectures that have the following properties:
- There is a single motherboard.
- The motherboard has multiple CPUs plugged into multiple CPU sockets.
- The CPUs access a single logical memory space with cache coherency.
- Each memory DIMM is physically connected to a single CPU socket (but typically each CPU is connected to more than one DIMM).
- Although there's a single logical memory space, there are different memory latency characteristics to access memory physically connected to a given CPU (fastest case) versus accessing memory phyiscally connected to another CPU (slower case).
As a concrete example, you're evaluating a CPU with 48 cores and 96 hyperthreads. You might connect this CPU to six 16 GB memory DIMMs, for 96 GB memory total with six lanes. Or you might buy a roughly equivalent CPU with the same core/hyperthread count and roughly similar clock speed, but the other CPU has UPI and supports 2 socket configurations. Then you could use this other CPU to build a NUMA system with 2 of these CPUs, each with six 16 GB DIMMs attached, for a total of 96 cores (192 hyperthreads) and 192 GB addressable memory.
Price
Having twice as many cores and twice as much addressable memory sounds pretty sweet. There are some problems though. First of all, these NUMA CPUs with UPI interconnects are going to cost a lot more. It's hard to make an exact apples-to-apples comparison, but if I look at the Wikipedia Sapphire Rapids page they list model number 8461V as having 48 cores (96 with hyperthreading), a base clock speed of 2.2 GHz, boosting to 2.8-3.7 GHz, and 300 W TDP. The claimed MSRP is $4491. This CPU doesn't have UPI support so only works with single socket motherboards.
A roughly equivalent CPU that supports 2 socket configurations is 8468 which has the same core count, a base clock speed of 2.1 GHz (boosting to 3.1-3.8 GHz), and 350 W TDP. The MSRP for this model on Intel's website is $7214. There are a few minor differences in the specs of the CPUs, but they're fairly close to equivalent, yet the 2S 8468 costs 60% more (and uses more power). This is because CPUs that support multi-socket configurations need a lot more special purpose silicon (for the UPI, more advancedcache coherency hardware, etc.) and generally exists in a niche market where people are kind of forced to pay more. The value you get per unit of performance (however you want to measure it) is a lot less when you order NUMA systems. Multi-socket motherboards also cost a lot more than single socket motherboards, which is another factor to consider.
Performance
In addition to price, the performance characteristics of NUMA and non-NUMA
systems are going to be wildly different. Accessing remote memory (i.e. memory
on a DIMM that is physically connected to another socket) is more expensive than
accessing memory on a local DIMM, hence what makes memory access non-uniform. If
you're streaming a lot of memory and fit within the supported UPI bandwidth the
marginal latency cost not be that high, but if you're doing random accesses it
will be higher. There are also certain operations that are going to be much more
expensive. I recently did a simple benchmark of atomic loads and stores and on a
2S system the overhead of doing atomic loads and stores with
std::memory_order_seq_qst
was up to 2x as high when benchmarking with a large
number of threads randomly doing loads and stores compared to the same benchmark
on the exact same CPU SKU except in a single socket configuration. It's hard to
generalize about how these latency costs apply to a generic application, because
it depends a lot on what the application is doing, but it's something worth
considering.
If you really want to get good performance on a NUMA system you want to strive to minimize cross-socket memory accesses. What we have done where I work is to isolate as many processes as possible to run on a single socket using Linux cpusets. On Linux you can explicitly configure the mempolicy of a process, but the default behavior is that when a thread makes an allocation from the kernel the kernel allocates memory on the same NUMA node that thread is currently running on. If you have a process that is unconstrained as to what CPUs it can run on it will end up over time running on more than one CPU socket and therefore make memory allocations in multiple memory "zones" (the Linux term for memory local to a given NUMA node). In a 2S system this means that roughly half the memory will be allocated on node 0 and the other half on node 1, meaning that you will get roughly 50% of memory access being local and 50% being remote. If instead you constrain process A to run only on socket 0 and process B to run only on socket 1 then each process will only make local memory allocations and only do local memory accesses. This is great for performance, but also to a certain extent defeats the entire point of using a NUMA system. After all, if you're using a NUMA system you presumably have an application that is using so many CPU cores and so much memory that it can't easily be made to run on just a single socket.
There are a lot of more advanced things you can do with memory policies, cpusets (which happen to be one of the threaded cgroups controllers), or regular thread affinities. For example, if you have an application that is doing multiple logical pieces of work that have some (but not a lot) of memory sharing then you might configure your application with multiple thread pools, and isolate different thread pools to run on different sockets. These strategies can help you get high performance on multi-socket NUMA systems, but they're also a lot of work to implement and typically require laborious manual/static partitioning of resources.
Conclusions
For certain types of very large applications you might need to buy expensive NUMA hardware. However, this tends to not only be more expensive, it also tends to perform worse and be much harder to configure. If possible, it's cheaper, faster, and simpler to use multiple single-socket computers. Granted, in a service-oriented architecture doing RPCs can also be quite expensive since any type of remote access is going to have a higher cost than a local access, even on a NUMA system. All of these things should be taken into account if you are trying to make informed architecture decisions.
It's also worth considering that over time core counts and maximum addressable memory continue to go up. Something that might have necessitated running on NUMA hardware a few years ago may not today.
If you are going to buy NUMA hardware it's definitely worth asking yourself if
you know how to actually use it effectively. There are loads of cargo cult
websites advising people to do things like run MySQL with numactl --interleave
which solves a very specific problem at potentially a huge cost in performance.
If you don't understand how these kinds of policies work you should seriously
reconsider if you really want to use NUMA hardware.