Hardware Intermediate #3: Memory Deep Dive — available, Dirty Pages, Container Limits

6 min read

After the CPU in Part 2, it’s memory’s turn. Hardware Basics #3 established the memory hierarchy, swap, the OOM Killer, and the idea that “spare memory becomes the page cache.” This post answers the questions those concepts raise in operations: how do you tell whether memory is genuinely short, where do the writes that suddenly flood the disk come from, and why does a container die when the server still has memory to spare?

The one column to watch in free #

Translated into operational terms, the conclusion of Basics #3 is this: in free output, the value to judge by is not the free column but the available column.

free -h
$ free -h
       total   used   free   shared  buff/cache  available
Mem:    62Gi   18Gi  1.2Gi    0.5Gi        43Gi       42Gi

Looking only at free 1.2GiB, things seem dangerous — but most of the 43GiB in buff/cache is page cache, which is handed back the moment an application asks for it. The estimate that accounts for this is available 42GiB, the kernel’s answer to “how much can a new process use right now without swapping?” Monitoring alarms should key on available rather than free, so that a normal state where the page cache is doing its job isn’t mistaken for an incident.

So how does a real shortage show itself? In Part 1’s framing, saturation is more decisive than utilization (where a shrinking available is the utilization signal). If swap in/out occurs continuously (vmstat’s si/so) and available keeps trending down, it’s a genuine shortage. Some swap being occupied is not in itself a problem — it can be the residue of unused pages parked there. What matters is not how much is sitting in swap but the flow in and out.

Dirty pages — writes pile up in memory first #

When an application writes to a file, the content doesn’t go straight to disk; it accumulates in the page cache as dirty pages (pages updated in memory but not yet written to disk). The kernel flushes them to disk gradually in the background, so writes feel fast — but two operational phenomena come with the design.

  • Write bursts (burst flush) — when dirty pages approach their limit, the kernel flushes them to disk all at once. A normally idle disk periodically spikes to 100%, and at that moment the latency of other I/O jumps. It’s a regular culprit behind reports of “the service stutters briefly every few minutes.”
  • Possible loss — if power is lost before the flush, those dirty pages are gone. This is why databases insist on fsync, and we’ll meet it again in Part 6 alongside battery-backed cache.

If bursts are the problem, the kernel’s dirty ratio settings (vm.dirty_ratio, vm.dirty_background_ratio) can shift flushing toward “a little at a time, more frequently.” But per the principle we’ll establish in Part 9, knobs like these get turned only after measurement has confirmed the cause.

swappiness — tuning swap’s temperament #

vm.swappiness is not “swap on or off” but the preference for what to evict first when memory gets tight. The kernel has two options: shrink the page cache, or push unused anonymous pages (process memory) out to swap. A high value (default 60) uses swap willingly; a low one shrinks the page cache first.

The convention of lowering swappiness on database servers (say, to 10) comes from this behavior. If the DB process’s memory gets swapped out, query latency falls to disk speed, so giving up page cache instead is the preferred trade-off. Conversely, pushing it near 0 on a general-purpose server is not recommended: remove the buffer that swap provides and a shortage goes straight to OOM.

OOM Killer victim selection has a knob too. Lower a process’s oom_score_adj (say, to -500) and that important process becomes the last candidate. It’s the standard answer to “when memory blows up, at least keep the DB alive.”

Container memory — the limit is the cgroup, not the server #

In the container era, the most common memory incidents happen at the container level, not the server level. A container’s memory limit is enforced by a cgroup (control group, the Linux feature that caps resource usage per group of processes), and when the limit is exceeded, that container’s process is killed by OOM even if the server as a whole has memory left. This is the OOMKilled you meet in Kubernetes.

Two points that trip people up in operations:

  • free inside a container shows host values. Run free inside a container and you see the host’s total memory, so an application can easily misjudge its own available limit. The actual limit has to be read from the cgroup file (memory.max) or the orchestrator’s configuration.
  • Page cache counts against the container’s usage too. A container doing heavy file I/O can approach its limit on cache alone, even with a small application footprint. When the limit is reached, the kernel reclaims that container’s cache first, so it usually doesn’t die — but it does explain usage graphs that look glued to the limit.

The remedy is also different from the server case. Server memory shortage means adding memory, but container OOMKilled is mostly a matter of aligning the limit with the application’s actual usage (heap settings and the like).

Common pitfalls #

  • Deciding to add memory from the free column — over-investment that mistakes page cache for a shortage. Judge by available and the swap flow.
  • Treating any swap usage as an incident — occupied swap is residue; moving swap is the symptom. Check whether si/so occur continuously.
  • Diagnosing a container OOM as a server memory problem — even with host memory to spare, exceeding the cgroup limit kills. If the thing that died is a container, compare its limit and actual usage first.

Wrap-up #

The picture from this post:

  • The criterion for memory headroom is available, not free, and the symptom of shortage is sustained swap in/out.
  • Writes pile up in memory as dirty pages and go down in batches, which can create periodic disk bursts.
  • swappiness is the preference for what to evict first under pressure, and oom_score_adj tunes OOM priority.
  • Container memory incidents happen at the cgroup limit. Look at the container’s limit and actual usage, not the server’s.

Next — NUMA #

The next post, “Hardware Intermediate #4: NUMA — Memory Is Not Uniform,” takes the memory story one level deeper. On servers with more than one CPU socket, not all memory is equally fast. We’ll cover the structure where performance depends on which core accesses which memory, and what that means for databases and virtualization.

X