Hardware Advanced #1: CPU Microarchitecture and perf — Why the Same 100% Isn't the Same

7 min read

In Hardware Intermediate we built a diagnostic method: read the metrics of the four resources and isolate the bottleneck. The advanced series extends that diagnosis in both directions. If the basics gave you concepts and the intermediate series gave you diagnosis, the advanced series goes down into kernel- and silicon-level observation, and up into the data center where servers live together.

The series runs 7 posts. CPU microarchitecture and perf (#1), eBPF observability (#2), the page cache, hugepages, and memory bandwidth (#3), and ZFS in depth (#4) cover the kernel-and-below side; data center power (#5), cooling and racks (#6), and firmware, BMCs, and the server lifecycle (#7) cover the data center side.

The question for this first post is this: on two servers both showing 100% utilization, are the two CPUs doing the same amount of work? The answer is “not necessarily,” and the tool that puts the difference into numbers is perf. The broader landscape of diagnostic tools, perf included, is covered in RHEL Advanced #3, so this post isn’t about how to run the tool — it’s about how to interpret the numbers in its output through the lens of microarchitecture.

The two faces of 100% utilization — IPC #

The CPU utilization the OS reports is “the fraction of time a task was scheduled on the core.” It makes no distinction between a core that actually retired instructions during that time and one that spun idle waiting for data to arrive from memory. A task occupies the core even while it waits on memory, so both show up as 100% in the utilization number.

The metric that exposes the difference is IPC (instructions per cycle — the number of instructions retired per clock cycle). A modern server CPU can retire 4 or more instructions per cycle, so the rough intuition goes like this:

  • IPC well below 1.0 is a sign the core is spending its cycles waiting rather than working. Memory access is usually the cause.
  • IPC well above 1.0 is a sign the core is packed with computation. That 100% is genuine compute saturation.

At the same 100%, IPC 0.5 versus 2.0 means a 4x difference in the amount of work the core actually got done. The prescriptions diverge too: in the former case, adding cores just adds more waiting; in the latter, adding cores or improving the algorithm is the straightforward fix.

Pipelines and superscalar — the assembly line inside the core #

To see why IPC fluctuates, you only need to open up one layer of the core. A core doesn’t finish one instruction before starting the next; it splits instruction processing into stages and overlaps them like an assembly line. That’s the pipeline. Put several of those lines side by side and push multiple instructions in every cycle, and you have a superscalar design.

The weakness of an assembly line is stopping. If the data the next instruction needs hasn’t arrived, or the line was filled down the wrong branch, the line goes empty or gets flushed entirely. Low IPC is the accumulated result of these stalls, and the two big causes of stalls are the subjects of the next two sections: cache misses and branch mispredictions.

The cache hierarchy — what one miss costs in cycles #

Basics #2 established the concept of cache as “small, fast memory next to the CPU.” What the advanced level needs is a feel for the cost at each tier. The numbers vary by generation, but the orders of magnitude look like this:

Where the data isRough latencyIntuition
L1 cache~4 cyclesA sticky note on your desk
L2 cache~12 cyclesA bookshelf in the same room
L3 cache~40 cyclesA cabinet at the end of the hallway
DRAM~200+ cyclesAn archive in another building

The key point is the 50x gap between L1 and DRAM. A single miss that goes all the way to DRAM burns time in which hundreds of instructions could have been retired. In code that misses frequently, the core spends more time waiting than working, and IPC sinks below 1.0. The classic culprits are random-order access over large arrays and pointer-chasing data structures that hop all over memory.

Branch prediction — a wrong guess flushes the line #

At a conditional, the core doesn’t wait. The branch predictor uses past patterns to bet “it’ll go this way” and fills the pipeline ahead of time with instructions from that direction. A correct prediction is free; a wrong one means flushing the misfilled line and refilling it. That costs roughly 15–20 cycles per miss.

Modern predictors get regular patterns right almost every time, so the branch miss rate usually stays under 1%. Conditional branches over unsorted data, or code whose path depends on unpredictable input, push that rate into the several-percent range — and shave IPC accordingly.

perf stat — pulling out the numbers behind utilization #

The CPU has built-in hardware counters (the PMU) that count these events, and perf stat reads them out.

perf stat
$ perf stat -p 4321 -- sleep 10

 Performance counter stats for process id '4321':

         39,812.43 msec task-clock                #    3.981 CPUs utilized
    98,234,567,890      cycles                    #    2.468 GHz
    49,876,543,210      instructions              #    0.51  insn per cycle
     8,123,456,789      branches                  #  204.043 M/sec
       123,456,789      branch-misses             #    1.52% of all branches
     2,345,678,901      cache-references
       987,654,321      cache-misses              #   42.10% of all cache refs

      10.001234567 seconds time elapsed

The reading order is three lines.

  • insn per cycle — this is IPC. The 0.51 in the output above means the core is spending most of its cycles waiting.
  • cache-misses ratio — at 42%, more than four in ten memory accesses punched through the cache, so you can name cache misses as the culprit behind the low IPC.
  • branch-misses ratio — 1.52% is ordinary. In this case, the branches are innocent.

One caution applies. If the effective clock — cycles divided by time, 2.468 GHz in the output above — is below the base clock, the throttling or governor issues we saw in Intermediate #2 may be overlapping the picture. Microarchitectural interpretation only makes sense on the premise that the clock is healthy, so check the effective clock first.

perf record and flame graphs — finding where it happens #

If perf stat tells you the character of the bottleneck — “this process is bound by memory stalls” — then perf record answers which part of the code is responsible. It periodically samples the running function and call stack, and perf report tallies which functions consumed the most cycles.

A flame graph lays those samples out in a single picture. The width of a box is the share of time that function (and everything it called) occupied; the vertical axis is call-stack depth. Find the wide peaks and you’ve found the code burning the cycles. The practical loop is: use perf stat to pin down the character of the bottleneck, then the flame graph to pin down its location.

Case study — different prescriptions for IPC 0.5 and 2.0 #

Let’s take two servers, both at 100% utilization, and carry the interpretation all the way through.

  • Server A: IPC 0.5, cache-misses 40% — the core looks busy, but most of its time goes to waiting on round trips to DRAM. Adding cores only adds more waiting cores. The prescription lies on the memory-access side: improving data-structure locality, straightening out access order, and the hugepages and memory-bandwidth checks we’ll cover in #3.
  • Server B: IPC 2.0, cache-misses 3% — the core is packed with computation. This 100% is genuine saturation with no hardware trick to relieve it, so the prescription is more cores, a better algorithm, or distributing the work.

The intermediate-level metrics can’t tell these two servers apart. Utilization, load, and clock can all be identical. IPC and the miss ratios are what finally separate them — and what stop you from spending money on the wrong prescription (adding cores to server A).

Wrap-up #

The picture we built in this post:

  • Utilization is only core occupancy time, not the amount of work. The density of work is what IPC shows.
  • The two big causes of low IPC are cache misses (a DRAM round trip costs 200+ cycles) and branch mispredictions (15–20 cycles each).
  • In perf stat, read three lines — IPC, cache-miss ratio, branch-miss ratio — and check that the effective clock is healthy first.
  • perf record and flame graphs give you the bottleneck’s location; perf stat gives you its character.
  • At the same 100%, IPC 0.5 calls for a memory prescription and 2.0 for a compute prescription.

Next up — eBPF observability #

The next post, “Hardware Advanced #2: eBPF Observability,” widens the lens from inside the CPU to the whole kernel. Where perf read hardware counters, eBPF plants small programs inside the kernel to watch system calls, disk I/O latency, and network paths while they run. It’s how you dissect a production server without restarting it.

X