Hardware Intermediate #1: Reading Performance Metrics — Turning Slow into Numbers

6 min read

In Hardware Basics we built the mental model that “slow and expensive almost always come down to one of four resources: CPU, memory, storage, network.” The intermediate series takes that model onto the operations floor. When a server actually slows down, the job is to read the metrics, identify which of the four is blocked, and prescribe a fix at the hardware level.

The series runs 9 posts. It starts with reading metrics (#1), moves through CPU (#2), memory (#3), NUMA (#4), storage performance (#5) and RAID operations (#6), storage networks (#7), and GPU servers (#8), and closes with #9, a walkthrough that diagnoses a slowed server from start to finish.

The subject of this first post is interpretation, not tools. How to bring up top is covered by RHEL Advanced #3. This post covers what the numbers on that screen say about what’s happening in the hardware.

Three questions for every resource — utilization, saturation, errors #

Wherever you look among the four resources, the questions are the same three. In performance analysis this frame is called USE (Utilization, Saturation, Errors).

QuestionMeaningExample
UtilizationThe fraction of time the resource spent workingCPU 80%, disk busy 60%
SaturationThe amount of work queued up waiting for the resourcerun queue, disk I/O queue
ErrorsFailures in the resource’s operationdisk I/O errors, packet drops

Of the three, the one operators miss most often is saturation. Utilization asks “is it busy right now?”; saturation asks “is a queue forming right now?” What creates the latency users feel is the queue — saturation. At 100% utilization with no queue, you’re simply using the resource efficiently; at 70% utilization with a long queue, you already have a bottleneck. With bursty traffic, queues form even when average utilization is low.

This frame runs through the whole series. The CPU in #2 and the disks in #5 both come down to repeating the same three questions: how high is utilization, is there a queue, are there errors.

Load average — the CPU queue plus the disk queue #

The most famous and most misread metric on Linux is the load average (the 1-, 5-, and 15-minute averages of the number of tasks that are running or waiting for CPU or I/O).

uptime
$ uptime
 14:02:11 up 41 days,  3:17,  1 user,  load average: 8.42, 6.10, 4.55

There are two keys to reading it.

  • It only means something relative to the core count. A load of 8 on an 8-core server is “exactly fully busy”; on a 2-core server it means “6 tasks are standing in line.” The core concept from Basics #2 becomes the baseline here.
  • On Linux, the load includes tasks waiting on disk. If the CPU is idle but the load is spiking, the culprit is likely storage, not the CPU. Load average is not a “CPU metric” — it’s the sum of the CPU queue and the I/O queue.

The slope across the three numbers is information too. If the 1-minute value is higher than the 15-minute value, the queue is growing; if it’s lower, the queue is draining.

The breakdown of CPU utilization — not all busy is the same #

CPU utilization isn’t one lump number; it has a breakdown. Depending on where the time went, the prescription changes completely.

FieldMeaningWhat a high value signals
us (user)Running application codeReal computation. Optimize the code or add cores
sy (system)Running kernel codeSystem-call storms, frequent context switching
wa (iowait)Time spent idle because the only work left was waiting on I/OThe bottleneck is storage, not the CPU
st (steal)Time the CPU couldn’t run because the hypervisor didn’t grant itCPU contention on a virtual machine. More in #2

wa in particular gets misread because of its name. iowait is not time the CPU spent working — it’s time the CPU sat idle because there was nothing to do but wait on I/O. If wa is at 40%, it doesn’t mean “the CPU is busy”; it means “the CPU is idle and the disk can’t keep up,” so your attention should shift to storage in #5.

A first checklist for the four resources #

Group the metrics by resource and you get the list of first questions to ask in front of a slowed server.

ResourceUtilizationSaturationErrors
CPUus+sy ratioload vs core count, run queue(rare) MCE logs
Memoryusage vs availableswap in/out activityOOM kill records
Storagedisk busy %I/O queue length, rising latencyI/O errors, retries
Networkbandwidth usagesend/receive queues, retransmitspacket drops, error counters

The commands that fill each cell vary by environment (vmstat, iostat, ss, or a monitoring console like CloudWatch in the cloud), but the cells themselves are the same everywhere. Working through this table from top to bottom is the skeleton of the walkthrough in #9.

Common pitfalls #

  • Watching utilization and ignoring saturation — “CPU is at 70%, so we have headroom” is a conclusion that ignores the queue. Layer bursts on top of a 70% average and the queue is already there. Always keep a queue metric next to the utilization number.
  • Reading load average as CPU demand only — Linux load mixes in I/O waits. When the load is high, suspect both the CPU and the disks.
  • Not knowing the normal values — whether a load of 6 is abnormal depends on whether the baseline was 2 or 5. Recording metrics when things are healthy is what lets you diagnose quickly when they aren’t.

Wrap-up #

What we covered:

  • For any resource, ask three things: utilization, saturation, errors. The latency users feel usually comes from saturation — the queue.
  • Read the load average against the core count, and remember that on Linux it’s a combined metric that mixes in I/O waits.
  • Split CPU utilization into its us, sy, wa, st breakdown. High wa moves your attention to storage.
  • The three-question checklist per resource is the starting point of a diagnosis, and recorded normal values are its reference line.

Next — CPU deep dive #

The next post, “Hardware Intermediate #2: CPU Deep Dive — Turbo, Throttling, Steal Time,” steps into the first resource: why the clock doesn’t run at the spec-sheet number (turbo and throttling), the shadow of virtualization that st reveals, and the cost of context switching — building the operations layer on top of the concepts from Basics #2.

X