Hardware Intermediate #6: RAID in Operation — Rebuild, Scrub, and Backups
If Part 5 was about the performance of a single disk, this post is about operating many of them. Basics #5 laid out the RAID levels (0/1/5/6/10) as concepts, so here we move to the scenes an operator actually faces. The real test of RAID begins not while every disk is healthy, but after one of them dies.
Degraded — running with one disk down #
When one disk dies, the array enters a degraded state — running on with its redundancy gone. The service keeps going, but two things change.
- Performance drops. With parity schemes (RAID5/6), the dead disk’s data has to be reconstructed from the surviving disks’ parity on every read.
- The margin is gone. With RAID5, just one more failure now ends the entire array.
So degraded is not a “still fine” state — it is a state where the countdown has started. Whether the array’s status actually reaches a human through monitoring is the first thing to check in RAID operations.
Rebuild — the most dangerous hours #
Swap in a replacement disk and the rebuild begins — recomputing data from the surviving disks to fill the new one. Paradoxically, this recovery window is the most dangerous time in an array’s life.
- It takes long. A rebuild fills the new disk from start to finish, so on tens-of-TB disks it can run past a day. Contending with service I/O makes it longer still.
- The surviving disks get hammered. A rebuild reads every remaining disk end to end. Disks bought in the same batch, with the same mileage on them, take maximum load all at once — so the odds of a second failure landing exactly now are higher than usual.
- You can hit a URE. A URE (Unrecoverable Read Error — a sector the disk ultimately fails to read) is listed in the spec sheet as a probability like “1 per 10^14 bits.” Normally redundancy hides it, but during a RAID5 rebuild there is no redundancy left to hide it. Reading tens of TB end to end, meeting one even once means the data at that spot is lost.
That last item is the basis of the adage “RAID5 is dangerous in the era of large disks.” The bigger the disk, the more a rebuild has to read and the higher the odds of hitting a URE — so with large disks, RAID6 with its two parity disks, or RAID10 with its fast rebuilds, becomes the conservative choice.
Hot spares and scrubs — two habits that reduce incidents #
Operations has two standard devices for cutting this risk.
- Hot spare — a standby disk plugged into the array in advance. On failure, the rebuild starts immediately without waiting for a human, shortening the time spent degraded. It is the insurance that covers the hours between a failure at dawn and someone arriving at work.
- Scrub — a periodic read of the entire array to find and repair latent bad sectors early. A URE in a region nothing ever reads stays hidden until rebuild day — surfacing at the worst possible moment — unless a scrub finds it first. mdadm, ZFS, and hardware controllers all have the same concept, usually run on a weekly or monthly schedule.
In short, the hot spare shrinks the time after an incident, and the scrub shrinks the probability of the incident itself. Either one costs about a line of configuration.
Write cache and the battery — trading safety for performance #
Much of a hardware RAID controller’s write performance comes from the write cache on the controller — write-back mode, which accepts a write into cache and immediately reports completion. It is the same trade-off as with dirty pages in Part 3, and it carries the same weakness. If power dies before the data reaches disk, that write is gone — and since the filesystem or database believed it was “done,” the damage runs deep.
That is why controllers carry a battery or flash backup (BBU/CacheVault) — a device that preserves the cache contents through an outage and finishes the writes after boot. The operational point is a single one. When the battery dies, the controller usually demotes itself to write-through, and write performance falls off a cliff. “The database writes suddenly got slow one day,” answered by an expired battery, is a common case. Battery health belongs on the monitoring list too.
RAID is not a backup — the operations version #
We touched on this in Basics #5, but it is worth restating in operational terms. What RAID protects against is disk hardware failure — that one thing.
- A file deleted by mistake is faithfully deleted on every disk at once.
- Ransomware’s encryption, and bad data written by a bug, are replicated just as faithfully.
- A controller failure or a fire takes the entire array in one stroke.
So RAID is an availability device (the service keeps running when a disk dies), and backup is a recovery device (going back to before things went wrong). They are not substitutes — you need both. A checksumming filesystem like ZFS adds integrity on top (detecting and self-healing silent data corruption), but that is not a substitute for backups either.
Common pitfalls #
- Degraded alerts never reach a human — the incident that ends with a second failure after weeks of running degraded is caused by the alerting path, not the technology. Check array-status monitoring first.
- Treating rebuild completion as the recovery — the rebuild itself is the peak-risk window. For that window at least, it is worth tightening the backup schedule or postponing change work — an operational judgment call.
- Running a large RAID5 without scrubs — a latent URE combined with a rebuild is the worst-case scenario. Turn scrubbing on, and for a new array, consider RAID6/10.
Wrap-up #
The picture from this post:
- Degraded is a countdown state, and the rebuild is the most dangerous window for three reasons: time, stress on the survivors, and UREs.
- The bigger the disks, the bigger RAID5’s rebuild risk. Hot spares and scrubs are the baseline equipment.
- Write-back cache performance is underwritten by the battery. Put battery health in monitoring too.
- RAID is availability; backup is recovery. Neither replaces the other.
Next — storage networking #
So far the disks lived inside the server. In the next post, “Hardware Intermediate #7: Storage Networking — iSCSI, FC, NVMe-oF, Multipath”, the disks leave the server. We will cover iSCSI and FC, which actually wire up the SAN that Basics #5 only saw as a concept, multipath, which builds paths that do not break on top of them, and NVMe-oF for the NVMe era.