Hardware
Hardware Advanced #7: Firmware, BMC, and the Lifecycle — The Other Computer Inside Your Server
A look at the BMC, the management computer that stays on independently of the main CPU. It covers remote console and power control, IPMI and Redfish, the firmware stack and update operations, failure prediction with SMART and ECC counters, management-network security, and the lifecycle from warranty expiry to disk disposal — closing out the Hardware Advanced series.
Hardware Advanced #6: Data Center Cooling and Racks — Electricity Always Becomes Heat
Nearly all the power that enters a server comes back out as heat. Starting from the basic airflow contract of front intake and rear exhaust, this post maps out data center cooling end to end: hot/cold aisle containment, rack density and the limits of air cooling, liquid cooling with D2C and immersion, and how ASHRAE temperature guidelines tie into PUE.
Hardware Advanced #5: Datacenter Power — The Real Reason You Can't Rack More Servers
Even with empty slots in the rack, new servers get rejected — because of the power budget. This post walks the power environment a server lives in, from an operator's point of view: PSU redundancy and A/B feeds, per-rack kW contracts, PDUs and UPS, generators and ATS, PUE, and the power density that GPU servers have driven up.
Hardware Advanced #4: ZFS Deep Dive — When RAID and the Filesystem Become One
ZFS merged RAID, volume management, and the filesystem into a single layer, solving the structural problems of the traditional stack. This post walks through it all from an operations point of view: copy-on-write that eliminates the write hole, checksums that verify every read with self-healing, resilver that copies only live data, RAIDZ and the ARC, snapshots with send/recv, and lz4 compression.
Hardware Advanced #3: Memory Deep Dive — Page Cache, THP, and Bandwidth
A tour inside the kernel memory machinery: the read and write paths through the page cache, the latency spikes THP creates, explicit hugepages and the TLB, how swappiness is actually implemented along with zswap, and the memory bandwidth bottleneck that keeps throughput flat even when cores sit idle.
Hardware Advanced #2: eBPF Observability — Seeing the Tail the Average Hides
eBPF is a technology for tracing system events directly with small programs that run safely inside the kernel. This post covers reading the latency distributions and tails that averages hide with biolatency and runqlat, a map of the BCC tools, and the overhead caveats for production use.
Hardware Advanced #1: CPU Microarchitecture and perf — Why the Same 100% Isn't the Same
Two CPUs can both read 100% utilization while getting very different amounts of work done. This post uses IPC, cache misses, and branch mispredictions to read the microarchitecture behind the utilization number, and shows how to tell memory stalls from genuine compute saturation in perf stat output.
Hardware Intermediate #9: Hands-On: Diagnosing a Slow Server — Series Finale
A diagnostic walkthrough that starts from a "the service is slow" report and narrows down through the four resources one by one. Define the symptom, check each resource, confirm the hypothesis, apply a fix, and re-measure. We close the Hardware Intermediate series with the principles of tuning.
Hardware Intermediate #8: GPUs and Accelerators — The Fifth Resource of the AI Era
The bottleneck of AI workloads often lies beyond the four resources. How a GPU works differently from a CPU, the VRAM and HBM that determine model capacity, reading nvidia-smi, and sharing a GPU with passthrough, vGPU, and MIG — all from an operator's perspective.
Hardware Intermediate #7: Storage Networking — iSCSI, FC, NVMe-oF, Multipath
Once the disk leaves the server, storage becomes a network problem. The trade-offs between iSCSI and FC, NVMe-oF for the NVMe era, multipath operations that take charge of path redundancy, and the connection to cloud block storage.
Hardware Intermediate #6: RAID in Operation — Rebuild, Scrub, and Backups
The real test of RAID begins after a disk dies. Why the rebuild is the most dangerous window, the URE problem that makes RAID5 risky in the era of large disks, what hot spares and scrubs do, the write cache and its battery, and why RAID is not a backup.
Hardware Intermediate #5: Measuring Storage Performance — fio, Queue Depth, Inside SSDs
Catalog IOPS only makes sense under specific conditions. How to measure under the conditions of your own workload with fio, the trade-off between queue depth and latency, and the internals — write amplification and TRIM — that make the same SSD perform differently today than it did yesterday.