Hardware Advanced #3: Memory Deep Dive — Page Cache, THP, and Bandwidth

9 min read

In Intermediate #3 we built the judgment calls for memory operations: reading headroom from available, spotting a dirty-page surge, and interpreting OOMKilled against cgroup limits. This post goes inside the mechanisms those judgments rest on. How exactly does the page cache handle reads and writes? Why does THP — the kernel quietly growing your pages — end up as the culprit behind latency spikes? And how do you confirm a memory bandwidth bottleneck when throughput stays flat even though the cores are idle?

The page cache — the path every file I/O takes #

When an application calls read(), the kernel checks the page cache before going anywhere near the disk. If the page is there (a cache hit), a single memory copy finishes the job and the disk is never touched. If it isn’t (a miss), the kernel sends an I/O down to the block layer, loads the result into the cache, and returns it. And if the kernel decides the access pattern looks sequential, it reads blocks ahead of the request as well (readahead). This is why sequential reads beat random reads not just on raw I/O but on cache hit rate too.

Writes go the other direction. A write() ends the moment the data lands in the page cache and the page is marked dirty; flushing it to disk is the job of the writeback threads, later.

The two paths through the page cache
Read   read() ─▶ page cache lookup ─▶ hit: return via memory copy (no disk access)
                                  └▶ miss: block I/O + readahead ─▶ load into cache, return

Write  write() ─▶ write into page cache + mark dirty ─▶ (returns immediately)
                                  └▶ writeback threads flush to disk later

That’s why the write call itself returns at memory speed — and why dirty-page buildup and surges exist as operational phenomena. The behavior and how to handle it were covered in Intermediate #3, so here we only need to confirm the path.

There is one exception on this path. A file opened with O_DIRECT bypasses the page cache and talks to the disk directly. Databases use this all the time: they have their own buffer pool, so going through the kernel cache too would put the same data in memory twice. If you’ve ever noticed “the page cache on this DB server is strangely small,” that may not be a failure — it may be this design working as intended.

THP — large pages that aren’t free #

The default Linux page is 4KB, and the results of translating virtual addresses to physical ones are cached in the TLB (Translation Lookaside Buffer). The problem is that TLB entries are limited to a few thousand per core. With 4KB pages that covers only a few dozen MB, so a process with a working set in the tens of GB ends up repeating TLB misses and page-table walks on memory access after memory access.

THP (Transparent Huge Pages) exists to cut that cost: the kernel automatically promotes 512 contiguous 4KB pages into a single 2MB page. The same number of TLB entries now covers 512 times the range, and for memory-heavy workloads that genuinely yields a few percent of extra throughput.

But there’s a price. A 2MB page needs 2MB of physically contiguous memory. On a server that has been running long enough to fragment, such contiguous regions are rare, so the kernel runs compaction — shuffling scattered pages around to manufacture contiguous space. When compaction happens synchronously on the allocation path, that allocation stalls for tens of milliseconds, and even the background khugepaged thread burns CPU and locking cost while it merges pages. It’s the classic trade: average throughput improves while tail latency gets worse.

You can check the current mode and usage here:

Checking THP status
$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

$ grep AnonHugePages /proc/meminfo
AnonHugePages:   8388608 kB     # anonymous memory promoted to 2MB pages

This is why database vendors recommend disabling THP. Databases care about tail latency more than averages, and systems like Redis that snapshot via fork suffer doubly: 2MB pages make the copy-on-write unit bigger, which bloats memory usage as well. The middle ground is the madvise mode — instead of automatic promotion everywhere, only regions explicitly requested via madvise() get huge pages. The answer to “should THP be on or off” comes down to whether your workload sells averages or sells tails.

Explicit hugepages — large pages reserved up front #

If THP’s problem is that it tries to manufacture huge pages “sneakily, at runtime,” the answer is to make them in advance. Explicit hugepages use vm.nr_hugepages to reserve 2MB (or 1GB) pages while fragmentation is still absent — right after boot, say — and only applications that ask for them get to use them. Reserved pages are never swapped and never need compaction, so you collect the TLB benefit without THP’s latency spikes.

Reservation status lives in /proc/meminfo:

Checking hugepage reservations
$ grep -i hugepages /proc/meminfo
HugePages_Total:    8192      # reserved 2MB pages (= 16GB)
HugePages_Free:     2048      # not yet in use
HugePages_Rsvd:      512      # requested but not yet touched
Hugepagesize:       2048 kB

The caveat: whatever you reserve comes straight out of general-purpose memory. Reserve 16GB as above, and to every ordinary process that can’t use hugepages the server simply has 16GB less RAM. The rule is to reserve only the memory the consumer has actually committed to.

The consumers are a known cast. PostgreSQL and Oracle offer settings to put their shared buffers on hugepages, KVM places guest memory there, and high-performance packet frameworks like DPDK put their buffer pools there. There’s a side benefit too: the page tables themselves shrink. On a database where hundreds of processes map the same tens-of-GB shared memory segment, the per-process page tables alone can eat several GB — make the pages 512 times bigger and those tables shrink accordingly.

Swap policy, deeper — how swappiness really works, plus zswap #

Intermediate #3 summarized swappiness as “a preference for what to evict first.” At the implementation level, the kernel keeps reclaimable pages on two lists — anonymous pages (process memory) and file pages (page cache) — and swappiness is the weight that decides in what ratio the two lists are scanned during reclaim. That’s why setting it to 0 does not turn swap off. If reclaiming every file page still isn’t enough, the kernel will swap anonymous pages anyway.

In cgroup v2 the range extends to 0–200, and any value above 100 is a declaration that “reclaiming anonymous pages is cheaper than reclaiming file pages.” In the era of swap on spinning disks that was nonsense; now that swap lands on NVMe or compressed memory, it has become a reasonable choice.

That compressed memory is zswap. Pages headed for swap are first stored in a compressed pool inside RAM, and only when the pool fills do the oldest ones go to disk. If your workload compresses around 2–3x, a large share of swap I/O is served at decompression speed rather than disk speed. This is why desktops and some cloud environments that overcommit memory ship with it enabled by default.

Memory bandwidth — cores idle, bus saturated #

So far we’ve covered memory’s capacity and latency, but there’s a third axis: bytes per second — bandwidth. A socket’s memory bandwidth is roughly the product of channel count and memory speed. Eight channels of DDR5-4800 put the theoretical ceiling around 300GB/s. It sounds like a big number, but run a few dozen cores through analytics queries or scientific code that sweeps large arrays simultaneously, and you’ll fill it.

The symptoms are distinctive. Adding cores doesn’t help, and even at 100% CPU utilization, perf stat shows an unusually low IPC (instructions per cycle). The cores aren’t working — they’re burning cycles waiting for data to arrive from memory.

perf stat: the signature of a bandwidth bottleneck
$ perf stat -a sleep 10
   1,284,332,109,442   cycles
     412,587,221,830   instructions   #  0.32 insn per cycle

Compared to the 2–4 IPC of a cache-friendly workload, an IPC in the 0.3 range signals that the cores spend most of their cycles waiting. If the perf topdown analysis from post #1 shows backend bound — specifically memory bound — dominating, the picture is confirmed, and tools like Intel’s pcm-memory will show you per-socket memory traffic directly in GB/s.

To measure the ceiling this server can actually sustain, use the STREAM benchmark — the classic tool that runs nothing but simple copy-and-add loops over arrays far larger than cache to measure sustainable bandwidth.

Example STREAM results
Function    Best Rate MB/s
Copy:           241854.3
Scale:          238102.7
Add:            252331.9
Triad:          251887.4      # ~84% of the 300GB/s theoretical ceiling

Around 80% of theoretical is the normal range. If your workload’s measured traffic is sitting right at that number, the bottleneck won’t yield to code optimization or more cores. The prescriptions point elsewhere: populate DIMMs evenly across every channel so all channels are live, and restructure memory access to be cache-friendly so the traffic itself shrinks.

Where this meets NUMA #

Every topic in this post comes back at the node level in Intermediate #4’s NUMA discussion. Bandwidth is computed per node, not per server, so one node’s bus can be saturated while another sits idle. Hugepage reservations and THP’s fragmentation and compaction also happen per node — you can end up with no contiguous space on node 0 while node 1 still has plenty. On multi-socket servers, keep numastat and per-node metrics open beside every tool from this post.

Common pitfalls #

  • Disabling THP everywhere, unconditionally — that’s the database recommendation generalized to all servers. Batch and analytics workloads that don’t care about tail latency get real gains from THP. Decide which axis your workload is sensitive to first.
  • Treating swappiness 0 as swap disabled — 0 is a scan weight, not a switch. Under extreme pressure swap still runs; to truly turn it off you must remove the swap area itself, and the cost of losing that buffer is exactly what Intermediate #3 laid out.
  • Seeing 100% CPU and adding cores — if IPC is low and memory bound dominates, the bottleneck is bandwidth, not cores. The new cores just share the same bus and wait together.

Wrap-up #

The picture we built in this post:

  • File I/O flows through the page cache — reads via hit/miss and readahead, writes via dirty marking and writeback. O_DIRECT bypasses the path.
  • THP reduces TLB misses but can inflate tail latency through compaction. Decide based on whether averages or tails matter to you.
  • Explicit hugepages reserve up front to collect the TLB benefit without THP’s side effects — the standard tool for databases, virtualization, and packet processing.
  • swappiness is a reclaim scan weight, and zswap is a compressed pool in front of swap that lowers its cost.
  • A memory bandwidth bottleneck is identified by low IPC and a dominant memory bound share, its ceiling is measured with STREAM, and the prescriptions are channel population and access patterns.

Next — ZFS deep dive #

The next post, “Hardware Advanced #4: ZFS Deep Dive,” moves to storage. We’ll cover the architecture that makes the filesystem itself responsible for data integrity through checksums and copy-on-write, the relationship between the ARC cache and memory, and the decisions operators have to make in RAID-Z and pool design.

X