Hardware Intermediate #4: NUMA — Memory Is Not Uniform

Saturday, June 6, 2026

5 min read

Everything we said about memory up to Part 3 carried one implicit assumption: memory is equally fast no matter which core accesses it. On a single-socket server that is mostly true, but it breaks the moment a server has two or more sockets. This post is about that non-uniformity — NUMA. It is what the single line “2 sockets” on a server spec sheet actually means for performance.

What NUMA is — memory gains a notion of distance #

NUMA (Non-Uniform Memory Access) is the memory architecture of multi-socket servers. Each CPU socket has memory attached directly to it, and the bundle of a socket plus its memory is called a NUMA node.

Structure of a two-socket server

┌─ Node 0 ──────────────┐              ┌─ Node 1 ──────────────┐
│  CPU socket 0         │ interconnect │  CPU socket 1         │
│  (cores 0-15)         │◀────────────▶│  (cores 16-31)        │
│  256GB memory (local) │              │  256GB memory (local) │
└───────────────────────┘              └───────────────────────┘

When a core in node 0 reads node 0’s memory, that is a local access; when it reads node 1’s memory, the request must cross the socket-to-socket interconnect — a remote access. Remote access has longer latency than local (roughly 1.5–2x) and narrower bandwidth as well. It adds one more layer to the memory-hierarchy picture from Basics #3: even within the same RAM, there is near RAM and far RAM.

The operating system knows this and acts on it. By default Linux allocates memory from the node where the process is running (first-touch policy), and the scheduler also tries to keep a task on the same node. The problem arises when that effort breaks down.

When it becomes a problem #

Three scenes are typical of NUMA surfacing as a performance incident.

A process using more memory than one node holds — on a server with two 256GB nodes, a database using 400GB inevitably spans both. Whichever core it runs on, about half of its accesses are remote.
One node running dry — if processes pile onto node 0, node 0 runs out first even while node 1 still has memory to spare. The kernel responds with remote allocations or page reclaim on node 0 (in bad cases, swap), and that is how the puzzling symptom of “total memory is fine but swap is churning” is produced.
Thread migration — when the scheduler moves a thread to another node to balance load, the thread’s memory stays on the original node. Every access after that is remote. This is why the pinning from Part 2 shows up not only for cache but for NUMA as well.

What the symptoms share is the shape of metrics that look fine while throughput refuses to materialize. If CPU utilization and memory both have headroom yet the same workload runs slower than on a single-socket box, it is time to suspect NUMA.

How to see it — numastat and numactl #

Check the structure with numactl --hardware and the behavior with numastat.

numastat

$ numastat
                    node0          node1
numa_hit         98214532       97103211
numa_miss         1203334        5421887
numa_foreign      5421887        1203334
other_node        1456220        5673001

Two rows matter. numa_hit counts allocations that landed on the intended node; numa_miss counts allocations that were served from another node instead because the intended one had run dry. If miss is growing to a non-trivial fraction of hit, the second scenario above — node imbalance — is in progress. Per process, numastat -p <PID> shows how much memory sits on which node.

To dictate placement directly, use numactl.

terminal

# Bind the process to node 0's cores and memory
numactl --cpunodebind=0 --membind=0 ./my-server

# Run with memory spread evenly across all nodes (interleave)
numactl --interleave=all ./my-database

That these two options are opposite strategies is the crux of NUMA handling. A workload that fits within one node gets bound so that everything is local; a workload larger than one node gets interleaved so it spreads evenly, avoiding the situation where only half of it is conspicuously slow. This is the logic behind some databases’ operations guides recommending interleaved execution.

In virtualization and the cloud #

The same structure shows through into virtual machines. The hypervisor tries to keep vCPUs and guest memory on the same physical node, but a large VM spans multiple nodes just as a physical server does. The common design in that case is to expose a virtual NUMA topology to the guest so the guest kernel can be aware of it and act accordingly.

For a cloud user the implication is simple. On small instances you will almost never meet NUMA; on large instances close to a whole physical server (tens of vCPUs and up), running numactl --hardware inside the guest may show multiple nodes. If you need to squeeze performance out of an instance that size, the tools in this post work the same way inside the cloud.

Common pitfalls #

Judging placement by total memory alone — 512GB of total headroom does not mean “512GB anywhere.” You have to look at per-node free memory to see the imbalance.
Believing binding is always better — pinning and membind are strategies for workloads that fit in one node. Bind a workload bigger than a node and you dry up that one node and invite swap. Choose binding or interleave by size.
Treating NUMA as server-room knowledge only — large instances, bare metal, and the GPU servers of #8 (which node a GPU hangs off) — NUMA follows you into the cloud era.

Wrap-up #

The picture from this post:

A multi-socket server’s memory is divided into nodes, and local and remote access run at different speeds.
The typical incidents are a process bigger than a node, node imbalance (total fine, swap churning), and thread migration.
Diagnose with numastat’s miss ratio, then use numactl to bind (small workloads) or interleave (large workloads).
The same structure appears in large virtual machines and cloud instances.

Next — measuring storage performance #

The next post, “Hardware Intermediate #5: Measuring Storage Performance — fio, Queue Depth, Inside SSDs,” moves on to the third resource. Basics #4 established the concepts of IOPS and latency; this time we measure them ourselves with fio, see how queue depth changes the numbers, and look at why an SSD’s internal behavior — write amplification and TRIM — can make yesterday’s performance differ from today’s.