Hardware Intermediate #5: Measuring Storage Performance — fio, Queue Depth, Inside SSDs
Having covered CPU and memory through Part 4, it’s storage’s turn. Basics #4 laid out the three axes of IOPS, throughput, and latency. This post asks an operations question: why does a disk whose catalog says “500K IOPS” deliver less than 50K on my service? The answer lies in the measurement conditions, so this post starts with measuring it yourself.
The fine print on catalog numbers #
Storage performance numbers always come with strings attached: block size (4KB or 128KB), pattern (sequential or random), read/write mix, and queue depth. The catalog’s maximum IOPS is usually the number under “4KB random read, queue depth in the dozens or more” — a condition with parallelism cranked to the maximum. If your workload differs from that condition, so does the number. So the meaningful question isn’t “how many IOPS does this disk have” but how many it delivers in the shape of my workload.
fio — measuring in the shape of your workload #
The standard measurement tool is fio. You specify block size, pattern, and parallelism, and it generates load in exactly the shape you want.
# 4KB random read, queue depth 1 — a condition where latency shows through directly
fio --name=randread --filename=/data/fio.test --size=4G \
--rw=randread --bs=4k --iodepth=1 --runtime=60 --time_based \
--ioengine=libaio --direct=1read: IOPS=11.2k, BW=43.8MiB/s
lat (usec): avg=88.1, ...
lat percentiles : 99.00th=[ 180], 99.90th=[ 420]Two points about reading it.
--direct=1is the key. It bypasses the page cache and measures the disk itself. Without it, the page cache we saw in Part 3 steps in and you end up measuring memory speed.- Look at percentile latency, not the average. An average of 88µs may look fine, but if the 99.9th percentile is several ms, one I/O in a thousand is slow. Percentiles are what shape user-perceived speed and a database’s tail latency.
One caution: fio is real load. Running a write test on a production disk contends with service I/O, and a wrong --filename can overwrite data. Measure against a dedicated file, in quiet hours, and ideally on equivalent non-production hardware.
Queue depth — trading latency for IOPS #
Queue depth (the number of outstanding I/O requests kept in flight against the disk) is the key to the gap between catalog numbers and what you experience. SSDs — NVMe in particular — are internally parallel, so you have to keep multiple requests in flight to get the full performance.
| Queue depth | IOPS | Avg latency | Meaning |
|---|---|---|---|
| 1 | 11k | 0.09ms | One at a time. Minimal latency, a fraction of the throughput |
| 8 | 70k | 0.11ms | Parallelism starts to kick in |
| 32 | 200k | 0.16ms | Throughput climbs, latency climbs too |
| 128 | 350k | 0.36ms | Near catalog. Latency is 4x |
(The numbers are an example measurement and vary by hardware.) A pattern emerges: the deeper the queue, the higher the IOPS — and the higher the latency. Utilization versus saturation from Part 1 replays here exactly, because queue depth is the amount of saturation itself. So half the complaints that “the catalog IOPS doesn’t show up” come from comparing a queue-depth-1 serial workload (say, a DB commit doing fsync in a single thread) against catalog conditions. The performance of such a workload is set not by IOPS but by latency at queue depth 1.
Inside SSDs — write amplification and TRIM #
The same SSD can be fast yesterday and slow today. The cause is usually in what the SSD does internally.
- SSD flash cannot be overwritten in place. It has to be erased and rewritten, and the erase unit (block) is much larger than the write unit (page). So the controller writes new data to empty space, marks the old data invalid, and later runs a cleanup (garbage collection) that gathers the valid pages, moves them, and erases the block.
- In this process, the user writes 1 and the device internally writes several times that — the phenomenon called write amplification. The fuller the disk and the more random the writes, the more frequent the cleanup, so amplification grows and write performance drops.
- TRIM is the command by which the filesystem tells the SSD “this region is deleted data.” If the controller knows the invalid pages in advance, cleanup gets lighter. On Linux it usually runs as a periodic
fstrim(systemd timer).
There are three operational implications. For an SSD, not filling it up is itself performance management (free space is the cleanup crew’s working room); it is worth verifying that TRIM actually runs; and write-performance measurements have to run long enough — a short test captures only the burst performance before cleanup kicks in — to produce real numbers.
On cloud disks #
On cloud block storage (such as EBS), the same principles show up in a different form. IOPS and throughput are set by the volume type and size, plus the instance’s own limits, so what fio measures is the contracted ceiling, not the disk’s physical performance. It’s common for the disk to be fast while the instance-side limit is what you hit, so you need to read both limit tables — volume and instance. The “reading the spec sheet” exercise from Basics #9 repeats for storage.
Common pitfalls #
- Measuring with the cache in the loop — without direct, you get page-cache speed. Use
--direct=1when measuring the disk; include the cache only when measuring the service as a whole. - Judging by average latency — the tail (99th percentile and beyond) sets user experience. A good average with a long tail makes an “occasionally slow” service.
- Trusting numbers from an empty SSD — a new or freshly emptied SSD carries no cleanup burden and runs fast for a while. It’s only operational performance if measured at production-level occupancy, for long enough.
Wrap-up #
The picture from this post:
- Storage numbers are a function of conditions — block size, pattern, queue depth. Measure with fio in the shape of your workload.
- Stacking queue depth raises IOPS and latency together. A serial workload’s performance is set by latency at queue depth 1.
- SSDs change write performance as they fill and as time passes, because of write amplification and cleanup. Free space and TRIM are the management levers.
- On cloud disks you measure the contracted limit, not physical performance. Check both the volume and instance limits.
Next — RAID in operation #
The next post, “Hardware Intermediate #6: RAID in Operation — Rebuild, Scrub, and Backups”, moves from one disk to many. Basics #5 established the concepts of RAID levels; this time it’s the story of what happens after a disk actually dies. Why a rebuild is a dangerous window, what hot spares and scrubs prevent, and why RAID is not a backup — seen through the eyes of an operational incident.