Hardware Advanced #7: Firmware, BMC, and the Lifecycle — The Other Computer Inside Your Server

8 min read

As long as the power cable is plugged in, there is one computer inside your server that stays on even when the server itself is off: the BMC (Baseboard Management Controller). Post #6 widened the view to data center power and cooling; this final post comes back inside the server. We’ll cover the BMC and the firmware stack, the signals that precede hardware failure, and the lifecycle of hardware from arrival to disposal — and then close out the series.

The BMC: a computer that stays alive when the power is off #

The BMC is a small independent computer mounted on the server’s motherboard. It has its own processor (usually ARM-based), its own memory, its own OS (typically embedded Linux), and its own network port. Because it operates completely separately from the main CPU, the BMC keeps responding even when the OS has panicked, the server won’t boot, or you’ve shut the machine down with the power button. It runs on the standby power the power cable supplies.

That independence is the BMC’s reason for existing. If you operate a single server, you can walk over with a monitor and keyboard — but if you have hundreds of servers in a data center in another city, you need a path into a broken server that lives outside the OS. Vendors call it different names. Dell has iDRAC, HPE has iLO, and Supermicro simply calls it IPMI or the BMC, but the job is the same.

What the BMC can do #

  • Remote console: From the boot screen to the BIOS setup screen, you see exactly what a directly attached monitor would show — in your browser. When the OS is dead and SSH is gone, this is the last remaining way in.
  • Power control: Power the server on and off remotely, or force a reset. It’s the only way to revive a server whose kernel has frozen solid.
  • Sensor readings: Temperatures, fan speeds, voltages, and power supply status, collected independently of the OS. The power measurements in post #5 and the temperature management in #6 mostly read their values from the BMC.
  • Virtual media: Mount an ISO file from your PC onto the server remotely, as if it were a USB drive. Reinstalling the OS no longer requires a trip to the data center.

The sensors are one command away.

ipmitool sensor (excerpt)
CPU1 Temp        | 54.000    | degrees C  | ok
System Temp      | 31.000    | degrees C  | ok
FAN1             | 6800.000  | RPM        | ok
12V              | 12.190    | Volts      | ok
PS1 Status       | 0x01      | discrete   | ok

IPMI and Redfish #

Two generations of protocols for talking to the BMC coexist today.

IPMI (Intelligent Platform Management Interface) is the legacy standard from 1998. With the ipmitool command you control power, query sensors, and read the event log. It’s still heavily used in the field because nearly every server supports it, but it’s a binary protocol that’s awkward to work with, and its aging design has accumulated security vulnerabilities over the years.

Redfish is its REST-based successor. It speaks JSON over HTTPS, so curl and jq are all you need for automation.

Query power state via Redfish
curl -sk -u admin:PASSWORD https://bmc.example.com/redfish/v1/Systems/1 | jq '.PowerState'
"On"

For jobs like collecting firmware versions across hundreds of machines or controlling power in bulk, Redfish is dramatically easier to work with. The current direction is clear: build new automation on Redfish by default, and keep IPMI only as a compatibility layer for older equipment.

A map of the firmware stack #

A single server carries more firmware than you’d expect.

FirmwareWhere it livesWhat it does
BIOS/UEFIMotherboardHardware initialization, bootloader handoff, memory and power settings
BMC firmwareBMC chipAll the remote management features above
NIC firmwareNetwork cardPacket processing, offload features
SSD firmwareEach driveWear leveling, garbage collection, cache policy
RAID card firmwareRAID controllerArray management, cache and battery control

A bug in any one of these produces symptoms the OS cannot explain. SSDs with a particular firmware version dying en masse once cumulative power-on hours cross a threshold, or NIC firmware bugs hanging the card on a specific packet pattern — these are recurring, real-world failure classes. When no amount of staring at the OS and the application yields a cause, remember that there’s another layer of suspects one floor down.

The problem is that firmware updates are easy to postpone. They often require a reboot, there’s the fear that a bad update leaves the machine unbootable, and there’s no immediately visible payoff. That’s why well-run organizations manage firmware as validated bundles, not individual files. They pick a vendor-validated bundle version on roughly a quarterly cadence, apply it first to one or two canary servers and watch for a while, then roll it out in waves piggybacked on kernel updates or scheduled maintenance windows where a reboot is needed anyway. The “we’ll update when something breaks” approach is the riskier one — at the moment something breaks, it forces you to jump across several years of accumulated changes in a single update.

Failures announce themselves #

Hardware looks like it dies suddenly, but it usually sends signals first. There are three places to watch.

SMART is the set of health indicators a drive records about itself. Watch the reallocated sector count, uncorrectable errors, and — on NVMe — the percentage of life remaining. A disk whose reallocated sector count has started moving off zero has a statistically much higher chance of failing, which means you can plan the replacement before it dies.

ECC error counters are memory’s signal. ECC memory silently corrects single-bit errors, and Linux tallies those corrections through the EDAC subsystem.

Check corrected memory error counts
grep . /sys/devices/system/edac/mc/mc*/ce_count
/sys/devices/system/edac/mc/mc0/ce_count:0
/sys/devices/system/edac/mc/mc1/ce_count:142

A correctable error (CE) is not an outage in itself, but a steadily climbing count on one particular DIMM is a signal: swap that module before it graduates to an uncorrectable error (UE) and takes the system down.

The BMC’s sensor event log (SEL) holds events the OS never gets to record. A power supply that blinked out and came back, a temperature threshold crossing, a fan stall — all logged in time order. If a server powered off out of nowhere and the OS logs show nothing, the answer is usually in ipmitool sel list.

BMC security: the most powerful door is the most dangerous one #

The BMC sits beneath the OS holding power and console in its hands, so a compromised BMC is equivalent to handing over the entire server. Yet BMC firmware is aging embedded software with frequent vulnerabilities, and its updates get postponed even more than the host’s. Hence two rules.

First, isolate BMC ports on a management network. Connect them only to a network separated from service traffic — physically or by VLAN — and never reachable from the internet. BMC login screens exposed to search engines are still being found by the tens of thousands.

Second, always change the default credentials. Factory accounts like admin/admin are on public lists. Newer equipment ships with per-unit random passwords, but if your gear is a few years old, check for yourself.

The hardware lifecycle: from arrival to disposal #

Servers typically arrive with a 3–5 year warranty. While the warranty is alive, component failures are the vendor’s problem; after it expires, parts sourcing and repair costs are all yours. That makes warranty expiry the natural baseline for replacement reviews.

The replacement decision, though, regularly collides with the intuition of “it still works, why throw it away?” This is where the power story from #5 comes back. CPU performance per watt has improved generation after generation, so one new server often does the work of several five-year-old servers on less power. Add up the electricity bill, the floor-space cost, and the staff time spent firefighting aging equipment, and there comes a point where replacing perfectly functional servers wins on total cost. The replacement decision belongs in that total-cost calculation, not on the depreciation ledger.

The last step is disposal. You can throw away the server, but the disks walk out with your data. Deleting files or formatting does not erase data, so the disposal procedure must spell out disk handling. The steps escalate from full overwrite, to Secure Erase (on SSDs, destroying the encryption key), to physical shredding in strictly regulated environments — and at every step, the procedure includes keeping a record of what was done. Security incident reports keep featuring the same pattern: customer data recovered from disks sold secondhand.

Wrap-up: closing the series #

One line per post, looking back at all seven:

  • Post #1 opened up CPU microarchitecture and used perf to read where cycles leak inside the core.
  • Post #2 covered eBPF as the way to observe kernel internals without recompiling the kernel.
  • Post #3 dug deeper into memory, chasing behavior beyond the page.
  • Post #4 examined how ZFS takes responsibility for data integrity at the filesystem level.
  • Post #5 turned the view outside the server, following the path electricity takes to reach it and what that power costs.
  • Post #6 followed that electricity back out as heat — cooling, and how to think in units of racks.
  • And post #7 covered the other computer inside the server, and the full lifespan of the hardware.

Hardware Basics was the series that built the concepts of components and metrics; Hardware Intermediate used those concepts to diagnose real servers. The advanced series pushed further in two directions from there, bringing both the microarchitecture beneath the kernel and the facilities above the data center floor into a single field of view. Now, when someone says “the server is slow,” you can lay everything from the instruction pipeline to the coolant piping on the same map and reason about it. That map is everything this series set out to leave you with.

X