Infrastructure
Hardware Advanced #7: Firmware, BMC, and the Lifecycle — The Other Computer Inside Your Server
A look at the BMC, the management computer that stays on independently of the main CPU. It covers remote console and power control, IPMI and Redfish, the firmware stack and update operations, failure prediction with SMART and ECC counters, management-network security, and the lifecycle from warranty expiry to disk disposal — closing out the Hardware Advanced series.
Hardware Advanced #6: Data Center Cooling and Racks — Electricity Always Becomes Heat
Nearly all the power that enters a server comes back out as heat. Starting from the basic airflow contract of front intake and rear exhaust, this post maps out data center cooling end to end: hot/cold aisle containment, rack density and the limits of air cooling, liquid cooling with D2C and immersion, and how ASHRAE temperature guidelines tie into PUE.
Hardware Advanced #5: Datacenter Power — The Real Reason You Can't Rack More Servers
Even with empty slots in the rack, new servers get rejected — because of the power budget. This post walks the power environment a server lives in, from an operator's point of view: PSU redundancy and A/B feeds, per-rack kW contracts, PDUs and UPS, generators and ATS, PUE, and the power density that GPU servers have driven up.
Hardware Advanced #4: ZFS Deep Dive — When RAID and the Filesystem Become One
ZFS merged RAID, volume management, and the filesystem into a single layer, solving the structural problems of the traditional stack. This post walks through it all from an operations point of view: copy-on-write that eliminates the write hole, checksums that verify every read with self-healing, resilver that copies only live data, RAIDZ and the ARC, snapshots with send/recv, and lz4 compression.
Hardware Advanced #3: Memory Deep Dive — Page Cache, THP, and Bandwidth
A tour inside the kernel memory machinery: the read and write paths through the page cache, the latency spikes THP creates, explicit hugepages and the TLB, how swappiness is actually implemented along with zswap, and the memory bandwidth bottleneck that keeps throughput flat even when cores sit idle.
Hardware Advanced #2: eBPF Observability — Seeing the Tail the Average Hides
eBPF is a technology for tracing system events directly with small programs that run safely inside the kernel. This post covers reading the latency distributions and tails that averages hide with biolatency and runqlat, a map of the BCC tools, and the overhead caveats for production use.
Hardware Advanced #1: CPU Microarchitecture and perf — Why the Same 100% Isn't the Same
Two CPUs can both read 100% utilization while getting very different amounts of work done. This post uses IPC, cache misses, and branch mispredictions to read the microarchitecture behind the utilization number, and shows how to tell memory stalls from genuine compute saturation in perf stat output.
Hardware Intermediate #9: Hands-On: Diagnosing a Slow Server — Series Finale
A diagnostic walkthrough that starts from a "the service is slow" report and narrows down through the four resources one by one. Define the symptom, check each resource, confirm the hypothesis, apply a fix, and re-measure. We close the Hardware Intermediate series with the principles of tuning.
Hardware Intermediate #8: GPUs and Accelerators — The Fifth Resource of the AI Era
The bottleneck of AI workloads often lies beyond the four resources. How a GPU works differently from a CPU, the VRAM and HBM that determine model capacity, reading nvidia-smi, and sharing a GPU with passthrough, vGPU, and MIG — all from an operator's perspective.
AWS Certified CloudOps Engineer - Associate (SOA-C03) #15 Full-Scale Multiple-Choice Mock Exam — 50 Questions + Explanations
The final post of the SOA-C03 series. Matched to the real exam's domain weights (monitoring 22% , reliability 22% , deployment automation 22% , networking 18% , security 16%), you solve 50 questions and find your weak domains through each question's answer and explanation. Solve them on the clock, then go back to the relevant domain post to shore up any gaps.
Hardware Intermediate #7: Storage Networking — iSCSI, FC, NVMe-oF, Multipath
Once the disk leaves the server, storage becomes a network problem. The trade-offs between iSCSI and FC, NVMe-oF for the NVMe era, multipath operations that take charge of path redundancy, and the connection to cloud block storage.
AWS Certified CloudOps Engineer - Associate (SOA-C03) #14 Exam Tips and Common Operational Scenario Mistakes
The fourteenth post of the SOA-C03 series, a final review right before the exam. It covers the common pitfalls that cut across domains, the keywords that separate similar services, how to read scenario questions, time management strategy, and a final pre-exam checklist.