Hardware Intermediate #9: Hands-On: Diagnosing a Slow Server — Series Finale

6 min read

With Part 8, all the per-resource parts have been collected. This final post fits those parts into a single incident. Starting from the most common report in operations — “the service is slow” — we walk through a diagnosis all the way to confirming the cause and verifying the fix. The scenario is fictional, but every judgment at every step is exactly what this series covered.

Step 0 — Turning the symptom into numbers #

A report comes in: “Since this afternoon, the API sometimes takes a few seconds.” The first step of diagnosis is not logging into the server but defining the symptom in numbers.

  • What: the API’s 99th-percentile response time, normally 0.3 seconds → intermittently 3-5 seconds since the afternoon
  • Since when: from around 14:00, for tens of seconds at a time, every few minutes
  • What it is not: the error rate is unchanged, and the overall average is only slightly up

Two clues — “intermittent” and “tail latency” — already point a direction. If a resource is short all the time, the symptom is constant too. If it is periodic, something periodic (a batch job, a flush, a backup) is behind it. This is exactly why Part 1 said a record of normal values matters. Without knowing “normally 0.3 seconds,” there is nothing to compare against.

Step 1 — Sweeping the four resources #

Following the checklist from Part 1, we sweep utilization, saturation, and errors across the resources in order.

vmstat 5 (at symptom time)
procs ---------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  9  10240 802340 211456 4.1e+07    0    0    52 91240 980 4100  9  6 41 44  0

A single line tells us a lot.

  • CPU: us 9 and sy 6 — the CPU is idle. st is 0, so it is not the steal from Part 2 either. But wa is 44 — the CPU is sitting idle waiting on I/O.
  • Memory: si/so are 0, so there is no swap flow. By the criteria from Part 3, this is not memory pressure.
  • Saturation: the b column (tasks sleeping in I/O wait) is 9. A queue has formed in front of the disk.
  • bo (block writes) is tens of times the usual. Someone suddenly started writing in bulk.

If the load average was high, that too is now explained. As we saw in Part 1, Linux load includes I/O wait. Of the four resources, this narrows down to storage saturation.

Step 2 — Zooming in on storage #

iostat -x shows the disk’s state, and then we look for who is writing.

iostat -x 5 (the affected disk only)
Device   r/s    w/s   wkB/s  aqu-sz  w_await  %util
nvme0n1  12.0  3100  364000   28.4     9.2     98.7

%util at 98.7, queue length (aqu-sz) at 28, write latency (w_await) at 9.2ms. Exactly as in Part 5: once the queue builds, latency climbs. Write latency that normally sits around 0.5ms has grown nearly 20x, so every request waiting on this disk — DB commits included — slows down with it. The shape matches the API’s tail latency.

Now for the “who.” The time pattern of the write bursts (every few minutes, tens of seconds) resembles the dirty-page flush from Part 3. Checking confirms it: a new feature deployed at 14:00 had started writing large logs with buffered I/O, and the kernel was periodically flushing the accumulated dirty pages all at once, saturating the disk. This was no longer a hypothesis — it was a confirmation. We pinned it down with a metric: the dirty page count (Dirty in /proc/meminfo) was climbing to several GB right before each flush.

Step 3 — Remediation, then re-measurement #

Remediations are possible at several layers, and we start with the cheap and certain ones.

  1. Remove the cause (application) — separate the logs so they no longer go to the same disk, or reduce the write volume itself. The root fix.
  2. Adjust the buffering (kernel) — if separation is not feasible right away, lower the dirty ratios so writes flush “a little at a time, more often.” A symptomatic fix that shrinks the size of the burst.
  3. Add resources (hardware) — if write IOPS were chronically short, more disks would be the answer, but this incident is a burst, not a chronic shortage, so adding capacity is overkill.

We apply option 1 and re-measure against the numbers from Step 0. When the 99th percentile is back to 0.3 seconds and wa and the disk queue are at their normal levels, the incident is closed. Because the symptom was defined in numbers, we can close it with “it is fixed” rather than “it seems fixed.”

The general form of a diagnosis #

The scenario was just one, but the procedure is general.

  1. Define the symptom in numbers (what, since when, what it is not).
  2. Sweep the four resources for utilization, saturation, and errors, and narrow down to one resource.
  3. Zoom in on that resource and confirm the who and the why with metrics.
  4. Apply the cheap and certain remediation first, then re-measure against Step 0’s numbers.

With a different ending, the road would be the same. If st had been high, we would branch into Part 2’s migration remedy; if si/so were moving, into Part 3’s memory diagnosis; if it were node imbalance, into Part 4’s NUMA; if it were a severed path, into Part 7’s multipath.

One last principle of tuning, as the conclusion of the series. Kernel parameters are the destination of a diagnosis, not its starting point. The sysctl knobs are tools to reach for only after the cause has been confirmed with metrics — one at a time, with before-and-after measurements. Applying a tuning recipe from the internet without measuring first is not diagnosis; it is a lottery ticket. The concrete knobs are covered in RHEL Advanced #2.

Common pitfalls #

  • Logging into the server first and defining the symptom later — a diagnosis without numbers ends in “it seemed better after a reboot.” Defining the symptom is Step 0.
  • Settling on the first hypothesis — jumping from wa straight to “the disk is getting old” ends with adding capacity, while the cause (the write burst) remains. Confirm the who and the why with metrics.
  • Fixing without re-measuring — without a re-measurement after the fix, you redo the same diagnosis from scratch at the next incident. Keep the before-and-after numbers on record.

Wrap-up — closing the series #

Looking back over the nine posts:

  • Ask every resource about utilization, saturation, and errors; what users feel comes from saturation (Part 1).
  • For CPU, watch the whims of the clock and steal (Part 2); for memory, available, dirty pages, and cgroup limits (Part 3); on multi-socket machines, NUMA (Part 4).
  • Measure storage under matched conditions (Part 5), design RAID for what comes after a death (Part 6), and for disks across the network, secure redundant paths (Part 7).
  • The GPU is a fifth resource, but feeding it is still the four resources’ job (Part 8).
  • And diagnosis always starts from a numeric definition of the symptom and ends with re-measurement (Part 9).

If Hardware Basics was the series that turned “slow” and “expensive” from guesses into things to diagnose, Intermediate was the series that gave you the practical skills to carry out that diagnosis. I hope you never find yourself lost in front of the metrics again. Thank you for following along.

X