Hardware Intermediate #1: Reading Performance Metrics — Turning Slow into Numbers
In Hardware Basics we built the mental model that “slow and expensive almost always come down to one of four resources: CPU, memory, storage, network.” The intermediate series takes that model onto the operations floor. When a server actually slows down, the job is to read the metrics, identify which of the four is blocked, and prescribe a fix at the hardware level.
The series runs 9 posts. It starts with reading metrics (#1), moves through CPU (#2), memory (#3), NUMA (#4), storage performance (#5) and RAID operations (#6), storage networks (#7), and GPU servers (#8), and closes with #9, a walkthrough that diagnoses a slowed server from start to finish.
The subject of this first post is interpretation, not tools. How to bring up top is covered by RHEL Advanced #3. This post covers what the numbers on that screen say about what’s happening in the hardware.
Three questions for every resource — utilization, saturation, errors #
Wherever you look among the four resources, the questions are the same three. In performance analysis this frame is called USE (Utilization, Saturation, Errors).
| Question | Meaning | Example |
|---|---|---|
| Utilization | The fraction of time the resource spent working | CPU 80%, disk busy 60% |
| Saturation | The amount of work queued up waiting for the resource | run queue, disk I/O queue |
| Errors | Failures in the resource’s operation | disk I/O errors, packet drops |
Of the three, the one operators miss most often is saturation. Utilization asks “is it busy right now?”; saturation asks “is a queue forming right now?” What creates the latency users feel is the queue — saturation. At 100% utilization with no queue, you’re simply using the resource efficiently; at 70% utilization with a long queue, you already have a bottleneck. With bursty traffic, queues form even when average utilization is low.
This frame runs through the whole series. The CPU in #2 and the disks in #5 both come down to repeating the same three questions: how high is utilization, is there a queue, are there errors.
Load average — the CPU queue plus the disk queue #
The most famous and most misread metric on Linux is the load average (the 1-, 5-, and 15-minute averages of the number of tasks that are running or waiting for CPU or I/O).
$ uptime
14:02:11 up 41 days, 3:17, 1 user, load average: 8.42, 6.10, 4.55There are two keys to reading it.
- It only means something relative to the core count. A load of 8 on an 8-core server is “exactly fully busy”; on a 2-core server it means “6 tasks are standing in line.” The core concept from Basics #2 becomes the baseline here.
- On Linux, the load includes tasks waiting on disk. If the CPU is idle but the load is spiking, the culprit is likely storage, not the CPU. Load average is not a “CPU metric” — it’s the sum of the CPU queue and the I/O queue.
The slope across the three numbers is information too. If the 1-minute value is higher than the 15-minute value, the queue is growing; if it’s lower, the queue is draining.
The breakdown of CPU utilization — not all busy is the same #
CPU utilization isn’t one lump number; it has a breakdown. Depending on where the time went, the prescription changes completely.
| Field | Meaning | What a high value signals |
|---|---|---|
| us (user) | Running application code | Real computation. Optimize the code or add cores |
| sy (system) | Running kernel code | System-call storms, frequent context switching |
| wa (iowait) | Time spent idle because the only work left was waiting on I/O | The bottleneck is storage, not the CPU |
| st (steal) | Time the CPU couldn’t run because the hypervisor didn’t grant it | CPU contention on a virtual machine. More in #2 |
wa in particular gets misread because of its name. iowait is not time the CPU spent working — it’s time the CPU sat idle because there was nothing to do but wait on I/O. If wa is at 40%, it doesn’t mean “the CPU is busy”; it means “the CPU is idle and the disk can’t keep up,” so your attention should shift to storage in #5.
A first checklist for the four resources #
Group the metrics by resource and you get the list of first questions to ask in front of a slowed server.
| Resource | Utilization | Saturation | Errors |
|---|---|---|---|
| CPU | us+sy ratio | load vs core count, run queue | (rare) MCE logs |
| Memory | usage vs available | swap in/out activity | OOM kill records |
| Storage | disk busy % | I/O queue length, rising latency | I/O errors, retries |
| Network | bandwidth usage | send/receive queues, retransmits | packet drops, error counters |
The commands that fill each cell vary by environment (vmstat, iostat, ss, or a monitoring console like CloudWatch in the cloud), but the cells themselves are the same everywhere. Working through this table from top to bottom is the skeleton of the walkthrough in #9.
Common pitfalls #
- Watching utilization and ignoring saturation — “CPU is at 70%, so we have headroom” is a conclusion that ignores the queue. Layer bursts on top of a 70% average and the queue is already there. Always keep a queue metric next to the utilization number.
- Reading load average as CPU demand only — Linux load mixes in I/O waits. When the load is high, suspect both the CPU and the disks.
- Not knowing the normal values — whether a load of 6 is abnormal depends on whether the baseline was 2 or 5. Recording metrics when things are healthy is what lets you diagnose quickly when they aren’t.
Wrap-up #
What we covered:
- For any resource, ask three things: utilization, saturation, errors. The latency users feel usually comes from saturation — the queue.
- Read the load average against the core count, and remember that on Linux it’s a combined metric that mixes in I/O waits.
- Split CPU utilization into its us, sy, wa, st breakdown. High wa moves your attention to storage.
- The three-question checklist per resource is the starting point of a diagnosis, and recorded normal values are its reference line.
Next — CPU deep dive #
The next post, “Hardware Intermediate #2: CPU Deep Dive — Turbo, Throttling, Steal Time,” steps into the first resource: why the clock doesn’t run at the spec-sheet number (turbo and throttling), the shadow of virtualization that st reveals, and the cost of context switching — building the operations layer on top of the concepts from Basics #2.