Hardware Intermediate #2: CPU Deep Dive — Turbo, Throttling, Steal Time

5 min read

In Part 1 we built the framework for reading metrics; now we go through the resources one at a time. First up is the CPU. Basics #2 laid out the concepts of cores, threads, clock, and cache; this post covers how those parts actually behave in production: a clock that doesn’t run at the spec-sheet number, a vCPU that’s yours yet not entirely yours, and a server that’s slow while its cores sit idle.

The clock is not a fixed value #

Even if the spec sheet says “3.0 GHz”, the actual clock swings moment to moment. Two mechanisms are responsible.

  • Turbo boost — raises the clock above base when there’s headroom in power and temperature. The key point is that it’s conditional: with only one or two cores working it climbs high, but with all cores busy the power limit holds it to a lower turbo frequency. This is the explanation for “fast in a single-core test, but per-core performance drops under all-core load.”
  • Thermal throttling — when temperature reaches the limit, the CPU forcibly lowers its clock to protect itself. On servers with weak cooling or in dense racks, it shows up as performance dropping a few minutes after load starts.

So when investigating a CPU performance problem, make it a habit to check the actual clock alongside utilization. If utilization is 100% but the clock sits below base, the problem is power or cooling, not the amount of work.

On top of this sits one more layer: the OS power policy. If the Linux CPU governor (the setting that decides the policy for adjusting the clock) is set to powersave, the clock ramps up slowly even when load arrives. Latency-sensitive servers really do turn sluggish from this one setting. For database or low-latency service machines, consider the performance governor.

Steal time — the time the hypervisor took #

Time to pick up the st (steal) we set aside in Part 1’s CPU breakdown. Steal time is the time a virtual machine tried to use the CPU but couldn’t run because the hypervisor gave the physical CPU to another guest. As we saw in Basics #7, a vCPU is a promise of a time-shared slice of a physical core, so when other virtual machines on the same host are busy, your turn comes late.

The operational baseline looks like this.

  • st staying near 0 is normal.
  • st persistently above a few percent means there’s a noisy neighbor on the same host, or the host itself is overcommitted.
  • The key point is that there’s nothing to fix inside your own VM. The remedy is moving the instance to another host (restart/redeploy) or switching to a more isolated instance type.

On burstable cloud instances (credit-based, like the t family), there’s another cause with similar symptoms. When CPU credits run out, you’re pinned to baseline performance. Looking at st and the credit balance together distinguishes the two cases.

Context switching — the cost of changing jobs #

When far more threads than cores are running, the CPU alternates between jobs in fine slices. That switch is a context switch (saving the state of the running task and jumping to another one), and it isn’t free. On top of the direct cost of saving and restoring state, there’s an indirect cost: the data piled up in the cache we saw in Basics #2 is irrelevant to the new task, so cache misses go up.

The symptom shows in Part 1’s breakdown. If us is low but sy is high, and context switches per second are several times the normal rate, the machine is spending its time switching between jobs rather than doing them. Common causes are worker thread counts far beyond core count, and threads repeatedly waking and sleeping under lock contention. The remedy is on the configuration side, not the hardware side. Start by bringing worker counts closer to the core count.

CPU pinning — fixing the seat to keep the cache #

By default the scheduler places tasks on any core and moves them around freely. For most workloads that’s the best behavior, but a process that’s extremely latency-sensitive loses its cache every time it moves. CPU pinning (fixing a process to specific cores) blocks that movement to preserve cache hit rates and latency consistency.

terminal
# Pin process PID 1234 to cores 2 and 3
taskset -cp 2,3 1234

Pinning cuts both ways, though. When the pinned cores are busy, the process can’t borrow other cores even if they’re idle. It’s unnecessary for ordinary services — it’s a tool you reach for only when latency jitter itself is the problem, such as in low-latency network processing or real-time workloads. The topic returns in Part 4 with NUMA, together with memory placement.

Common pitfalls #

  • Promising performance based on the spec-sheet clock — the effective clock under all-core load is lower than single-core turbo. Size capacity plans against all-core load.
  • Diagnosing st as your own server’s problem — steal is determined at the host level. No amount of process optimization inside the VM reduces it. Migration is the remedy.
  • Adding threads to gain speed — if the cores are saturated, adding threads only adds context switching. Check Part 1’s saturation metrics before scaling threads.

Wrap-up #

The picture from this post:

  • The clock swings with turbo, throttling, and governor policy. Look at the actual clock along with utilization.
  • Steal time is a contention signal at the hypervisor level, and the remedy is migration, not internal tuning.
  • A context-switching storm shows up as rising sy, and the common cause is excessive thread settings.
  • CPU pinning is a tool reserved for workloads where latency consistency matters.

Next — memory deep dive #

The next post, “Hardware Intermediate #3: Memory Deep Dive — available, Dirty Pages, Container Limits,” moves on to the second resource. Basics #3 laid out the concepts of the page cache and the OOM Killer; next we step up to the operations layer — reading free output precisely, the dirty pages behind write bursts, and memory limits in the container era.

X