Hardware Advanced #4: ZFS Deep Dive — When RAID and the Filesystem Become One

9 min read

Intermediate #6 covered the dark side of running RAID: the rebuild is the most dangerous window, UREs threaten RAID5 on large disks, and without scrubbing, bad sectors stay hidden until the worst possible moment. ZFS, which got only a passing mention in that post, solved most of these problems not with operational technique but with structure. This post lays out what ZFS designed differently, how, and what it asks for in return.

The problem with the traditional stack — split the layers and you split the information #

The traditional storage stack is built in layers. At the bottom, a RAID card or mdadm bundles disks into a single block device; above that, a volume manager like LVM carves up the space; on top, a filesystem like ext4 or XFS holds the files. LVM itself belongs to the world of RHEL administration, so we’ll leave it at a name-drop here. The point is that each layer knows nothing about the others.

  • The RAID layer doesn’t know which blocks hold data. So a rebuild copies the entire disk, empty blocks included. An array that’s 10% full rebuilds in the same time as one that’s 100% full.
  • The write hole appears. Updating one RAID5 stripe means writing both the data block and the parity block, and if power dies between the two, the parity is left out of sync with the data. The RAID layer knows nothing about filesystem transactions, so it has no way to tell which writes belong together — and mismatched parity produces corrupt data on the next rebuild.
  • Silent corruption goes uncaught. If a disk returns wrong bits without reporting an error (bit rot), the RAID layer passes them straight up. It holds the answer key — parity — yet never checks it on ordinary reads.

Each layer does its own job faithfully, but the information gap between layers creates structural blind spots.

The ZFS answer — merge the layers, never overwrite in place #

ZFS merged RAID, volume management, and the filesystem into one piece of software. Disks are grouped into vdevs (virtual devices), vdevs are combined into a pool, and filesystems are created on top — and because it’s all one layer, the filesystem knows down at the RAID level which blocks are live and which writes belong to a single transaction. In commands, the many steps of the traditional stack shrink to two lines.

# Build a RAIDZ2 pool from 6 disks with 2 disks' worth of parity
zpool create tank raidz2 sda sdb sdc sdd sde sdf

# Create a filesystem on the pool (no separate format or mount setup)
zfs create tank/data

Partitioning, RAID configuration, formatting, and fstab entries — work once spread across separate tools — collapse into pool creation and filesystem creation. The new filesystem is mounted and ready to use immediately.

On top of this comes CoW (copy-on-write). ZFS never overwrites data in place. Modified content is written to fresh blocks, and only after the write completes does the pointer flip to the new blocks. The pointer switch is atomic, so no matter when power dies, the disk always holds a consistent state — either before the switch or after it. Since a half-updated state never exists on disk, the write hole vanishes at the root, and there’s no need to fsck the filesystem after boot.

Checksums and self-healing — verify on every read #

ZFS stores the checksum of every block not in the block itself but next to the pointer in the parent block. The checksum is compared on every read, so even if a disk returns bad data without reporting an error, it’s caught on the spot. Because the checksum doesn’t live alongside the data, even corruption that replaces an entire block with the wrong contents gets detected.

In a configuration with redundancy — mirrors or RAIDZ — it doesn’t stop at detection. When ZFS finds a copy whose checksum doesn’t match, it reads the correct data from another copy, serves it, and rewrites the broken copy with the good data. This is self-healing. The “we don’t know which side is right” dilemma of traditional RAID always gets a verdict in ZFS, because the checksum acts as the referee. The healing history shows up in the CKSUM column of zpool status.

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     3

A disk with a nonzero CKSUM count has, at some point, returned bad data without reporting an error. If the number keeps climbing, suspect the cable, then the controller, then the disk, and consider replacement. Corruption the traditional stack would never even have noticed shows up here as a number.

Resilver — a rebuild that copies only data #

When you replace a dead disk in ZFS, a resilver runs instead of a rebuild. The difference is more than a name. Because the filesystem knows which blocks are live, a resilver copies only the blocks that actually hold data. If the pool is 30% full, the copy is 30%. For the problem from Intermediate #6 — the longer the rebuild, the wider the window for a second failure or a URE — this answer shrinks the window itself.

Every block copied is also checksum-verified along the way. If a read error turns up mid-resilver, the array doesn’t collapse; ZFS pinpoints exactly which file the affected block belongs to. The blast radius narrows from “a sector somewhere” to “this file.”

RAIDZ — up to three disks of parity #

ZFS’s parity configurations are RAIDZ1/2/3, with one, two, and three disks’ worth of parity respectively. Compared with mdadm’s RAID5/6:

  • RAIDZ1 ≈ RAID5 and RAIDZ2 ≈ RAID6, but with two differences: CoW means no write hole, and resilver is faster.
  • RAIDZ3 carries three disks of parity. mdadm has no equivalent. In an era of multi-dozen-TB disks, it’s the conservative option that survives two more deaths during a days-long resilver.
  • There’s a weakness too. A single RAIDZ vdev delivers random IOPS roughly on par with a single disk. Capacity efficiency calls for RAIDZ; random I/O performance calls for multiple mirror vdevs — that split is the basic formula of ZFS design.

In a table:

ConfigurationParitymdadm equivalentWrite holeRecovery
RAIDZ11 diskRAID5Noneresilver (data only)
RAIDZ22 disksRAID6Noneresilver (data only)
RAIDZ33 disksNo equivalentNoneresilver (data only)
mdadm RAID5/61–2 disks-Presentrebuild (full copy)

The ARC and memory — why ZFS uses so much RAM #

Instead of the operating system’s page cache, ZFS uses its own cache, the ARC (Adaptive Replacement Cache). The algorithm tracks both recently used and frequently used blocks, which gives it a strong hit rate — but by default it claims up to half of system RAM. That’s the source of ZFS’s “memory hog” reputation. To be precise, it uses spare memory as cache and gives it back when other processes need it. The handback isn’t instantaneous, though, so when ZFS shares a machine with an application that manages its own memory — a database, say — the standard move is to cap the ARC.

If RAM is short, you can attach an SSD as a second-level cache, the L2ARC. The catch is that the index pointing to L2ARC blocks lives in RAM, so bolting a large L2ARC onto a RAM-starved machine actually eats into the ARC. Add RAM first. As an aside, the heavyweight “1GB per 1TB” requirement you may have heard applies to deduplication; for general use, ZFS runs comfortably on 8GB or more.

Snapshots and send/recv — the bonus CoW pays out #

Under CoW, snapshots are nearly free. Old blocks are never overwritten anyway, so a snapshot is just a single marker that says “don’t delete the pointers from this moment.” Creation is instant, and the space cost is only the changes made afterward. That’s why keeping dozens of hourly snapshots is everyday practice in ZFS operations.

Snapshots can be serialized with zfs send and shipped to another pool or another machine, and incremental sends that transfer only the difference between two snapshots work too.

# Create a snapshot (finishes instantly)
zfs snapshot tank/data@2026-06-15

# Send it to the backup server
zfs send tank/data@2026-06-15 | ssh backup zfs recv pool/data

# From the next day on, send only the incremental difference
zfs send -i tank/data@2026-06-15 tank/data@2026-06-16 | ssh backup zfs recv pool/data

Unlike rsync, which compares file by file, ZFS already knows which blocks changed, so incremental backups are fast. One caveat: snapshots inside the same pool are for recovering from mistakes — they are not backups. If the pool dies, the snapshots die with it, so the conclusion from Intermediate #6 holds here too. A backup starts with the copy you’ve sent to another machine via send/recv.

Compression — lz4 on is the default #

ZFS supports transparent block-level compression. lz4 compresses and decompresses so fast that the saved disk I/O usually outweighs the CPU cost, and it has early-abort logic for incompressible data, so it rarely loses you anything. That’s why “just turn on compression=lz4” has been the ZFS community’s long-standing default — and recent OpenZFS enables it out of the box. If you need a higher compression ratio, zstd is available with tunable levels.

Operational cautions — scrubs and pool growth #

  • Scrubs are still necessary. Checksum verification only happens on reads, so corruption in data nobody reads stays latent until a scrub runs. Schedule zpool scrub at roughly monthly cadence, and verify the alerting path that gets the results in front of a human.
  • Leave headroom in the pool. CoW always hunts for free blocks to write, so a full pool degrades sharply from fragmentation. Treating 80–90% as the operational ceiling is the convention.
  • Vdev expansion has constraints. Growing a pool by adding vdevs is easy, but slotting one more disk into an existing RAIDZ vdev was impossible for a long time. OpenZFS 2.3 made it possible with raidz expansion, but existing data keeps its old parity ratio, so space efficiency can come out lower than the math suggests. If your plan is to start with a few disks and grow one at a time, choosing the vdev layout carefully up front is still the right answer.

Wrap-up #

The picture this post built:

  • The traditional stack carries structural gaps — the write hole, full-copy rebuilds, silent corruption — because the RAID, volume, and filesystem layers can’t share information.
  • ZFS merged the three layers and eliminated in-place overwrites with CoW, removing the write hole by construction.
  • Every read passes checksum verification, and with redundancy, ZFS self-heals. Resilver copies only live data, shrinking the dangerous window.
  • RAIDZ goes up to three disks of parity, but mirrors win on random IOPS. Budget plenty of memory for the ARC.
  • Snapshots are nearly free, but inside the same pool they aren’t backups. The copy sent to another machine via send/recv is the backup.

Next — data center power #

That wraps up the story inside the server. We started at the CPU and worked down through memory, disks, and the filesystem that ties them together, so the next post, “Hardware Advanced #5: Data Center Power,” steps outside the box. We’ll cover how electricity enters and gets distributed in a building packed with hundreds of servers, which gaps the UPS and generators fill, and what the power-efficiency metric PUE actually tells you.

X