Hardware Intermediate #8: GPUs and Accelerators — The Fifth Resource of the AI Era
This series has viewed the world through four resources: CPU, memory, storage, and network. Then AI workloads entered the server room, and a fifth resource became standard equipment: the GPU. This post looks at the GPU through an operator’s eyes — not the depths of its internal architecture, but the metrics, the bottlenecks, and the sharing problems you face when running GPU servers.
What makes a GPU different — a wide, shallow workforce #
The CPU of Basics #2 was a small number of cores that handle complex work fast. The GPU is the opposite design. It deploys thousands of cores that only do simple math and applies the same operation to massive amounts of data at once. Matrix multiplication is exactly that kind of work, and deep learning training and inference are essentially giant matrix multiplications repeated — which is why the GPU became the standard equipment of AI.
One corollary matters for operations: a GPU cannot work alone. Preparing data, sending it into GPU memory, and collecting the results is the job of the CPU, memory, and storage. In other words, the bottleneck of a GPU server may not be the GPU. A training server with low GPU utilization where the culprit turns out to be slow data loading (storage) or preprocessing (CPU) is a common story. The diagnostic methods for the four resources remain the foundation on GPU servers too.
VRAM — the new memory wall #
A GPU carries its own memory, VRAM, and high-performance GPUs use HBM (High Bandwidth Memory — memory stacked on the GPU package to maximize bandwidth). The difference from regular RAM is a capacity-versus-bandwidth trade. Unlike system RAM, which scales to hundreds of GB or even TB, VRAM stays at the tens-of-GB level but delivers bandwidth of several TB/s. That bandwidth is what it takes to keep thousands of cores from starving.
This structure creates the first constraint of AI operations: does the model and its working data fit in VRAM? If a model is bigger than VRAM, you split it across multiple GPUs or shrink it by lowering precision (quantization). In LLM inference, on top of the model’s weights, the context of concurrent requests (the KV cache) also eats VRAM, so the number of concurrent users becomes a function of VRAM capacity. The problem that plays out as “out of memory → swap” in the CPU world appears in the GPU world in a far more uncompromising form: “out of VRAM → immediate OOM error.”
nvidia-smi — the GPU checklist #
The basic tool of GPU operations is nvidia-smi. We read it by applying the utilization–saturation–errors frame from Part 1 as is.
+-------------------------------+----------------------+
| GPU Name Persistence-M | GPU-Util Memory-Usage |
| 0 H100 80GB On | 92% 71GiB/80GiB |
| Temp 76C Pwr 610W / 700W | |
+-------------------------------+----------------------+- GPU-Util (utilization) — a metric to read with care. It means “a kernel was running at that moment,” so it can show 100% even when only a fraction of the thousands of cores are working. High utilization doesn’t mean the GPU is being used well; conversely, low utilization is definitely a problem (a supply bottleneck).
- Memory-Usage (VRAM) — the gauge of the first constraint above. Near the limit, the next request can be the OOM.
- Temp / Pwr — the throttling of Part 2 exists on GPUs too. Hit the temperature or power limit, and the clock gets cut. In dense GPU servers, cooling is a performance item.
- Errors — the ECC error count and XID logs (GPU error codes left by the driver) are the signals of hardware trouble.
Sharing a GPU — passthrough, vGPU, MIG #
GPUs are expensive, and not every workload uses a whole GPU. So the virtualization question of Basics #7 repeats with GPUs: who gets a card, and how?
| Method | What it does | Where it fits |
|---|---|---|
| Passthrough | Wires one whole GPU directly to a single virtual machine | Minimal performance loss. No sharing |
| vGPU | Time-slices the GPU in software across multiple VMs | VDI, graphics, light sharing |
| MIG | Splits the GPU into independent partitions at the hardware level | Isolated multi-tenancy for inference services |
MIG (Multi-Instance GPU) is the operationally interesting one. It physically divides the compute units, the VRAM, and even the cache, so the noisy-neighbor problem we saw in Part 2 structurally doesn’t exist between partitions. The cloud products that sell one big GPU as if it were seven small inference GPUs sit on this technology. Conversely, for work that burns the whole GPU, like training, passthrough (on the cloud, a dedicated GPU instance) is the default.
Go multi-GPU, and the connection becomes yet another resource. Dedicated interconnects that link GPUs with far greater bandwidth than PCIe (NVLink and the like) and high-speed networks between servers (RDMA) determine the performance of distributed training. The NUMA of Part 4 returns as well: which socket a GPU hangs off changes the cost of moving data between CPU and GPU.
Common pitfalls #
- Reading GPU-Util 100% as full throttle — it only means a kernel was running. Read it together with throughput (tokens per second, step time); precise utilization is the realm of profiling tools.
- Buying the GPU and ignoring the supply lines — if data loading, preprocessing, or the network starves it, the expensive GPU idles. When GPU utilization is low, turn your eyes back to the four resources.
- Planning VRAM by the average — exceeding VRAM means an immediate OOM, so plan for the maximum (the peak of concurrent requests), not the average.
Wrap-up #
The picture we built in this post:
- A GPU is a device that runs the same operation across thousands of simple cores at once, and it only works when the supply (CPU, storage, network) keeps up.
- VRAM is the first constraint of AI operations. The model plus the peak of concurrent requests is the basis of capacity planning.
- Read nvidia-smi with the utilization–saturation–errors frame too, but know exactly what GPU-Util means. Temperature and power are the forecast of throttling.
- Sharing is a spectrum of passthrough, vGPU, and MIG; isolation level and use case are the selection criteria.
Next — a hands-on diagnosis walkthrough #
All the parts are now on the table. In the final post, “Hardware Intermediate #9: Hands-On: Diagnosing a Slow Server — Series Finale,” we apply the entire series to a single incident. Starting from the report “the service is slow,” we narrow down across the four resources, find the cause, and verify the fix — following an operational diagnosis from start to finish.