Hardware Advanced #2: eBPF Observability — Seeing the Tail the Average Hides

8 min read

The perf we used in #1 was a sampling tool. It snapshotted where the CPU was thousands of times per second and drew a statistical picture of “where the time goes.” eBPF, the subject of this post, takes a different approach. You attach a small program at the point in the kernel where an event happens, and it records every single occurrence directly. If sampling is a landscape photo, eBPF is a traffic log at a checkpoint. The latency of every individual disk I/O, the time every individual task waited before getting a CPU — all of it gets captured.

This difference matters because the problems that really hurt in operations live in the tail, not the average.

What averages hide — distributions and tail latency #

The %wa and load average from Intermediate #1 are a good starting point, but both are averages. Averages flatten the distribution.

Say 990 out of 1,000 disk I/Os took 0.2ms and 10 took 200ms. The average comes out around 2.2ms — looks perfectly fine. But if a single user request triggers dozens of I/Os, a substantial share of requests will step on one of those 10 slow ones. The thing users describe as “it stutters once in a while” is exactly this tail latency — the handful of slow events at the far end of the distribution. Averages and utilization metrics can’t show you this tail. You need the whole distribution: a histogram.

eBPF is the right tool for this job because it can aggregate every event inside the kernel. Shipping data to user space for each event would be expensive, but accumulating a histogram in the kernel and pulling out only the summary at the end keeps the cost low.

eBPF — small programs that run inside the kernel #

eBPF (extended Berkeley Packet Filter) started out as a packet filter, as the name suggests, but today it’s an execution environment for general-purpose mini programs that run safely inside the kernel. Attach a program to a kernel event — a system call entry, a disk I/O completion, a scheduler task switch — and it runs every time that event fires, taking timestamps and bumping counters.

If “running arbitrary code inside the kernel” sounds dangerous, your instincts are right. Doing the same thing with a kernel module means a single bug ends in a kernel panic. What sets eBPF apart is the verifier. Before a program is loaded into the kernel, the verifier inspects every execution path and rejects anything with a loop that might not terminate, a disallowed memory access, or a size beyond the limit. Only programs that pass verification ever run, so an eBPF tracing tool taking down the kernel is structurally ruled out.

You’ll rarely write eBPF by hand. Operators typically work with two layers of tooling: bpftrace for ad-hoc questions, and the BCC tool collection for structured analysis.

bpftrace — asking questions in one line #

bpftrace lets you write eBPF programs as one-line scripts. The shape is “event /condition/ { action }”.

bpftrace one-liner — tracing file opens
# Print, in real time, which process is opening which file on this server
$ sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)); }'
nginx /var/log/nginx/access.log
postgres /var/lib/pgsql/data/base/16384/2619

tracepoint:syscalls:sys_enter_openat is the event to attach to, and what’s inside the braces runs on every event. A histogram is a one-liner in the same syntax.

bpftrace one-liner — read size distribution
# Aggregate the distribution of byte counts returned by read(), inside the kernel
$ sudo bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret > 0/ { @bytes = hist(args->ret); }'

Hit Ctrl-C and the histogram accumulated so far is printed. This “get only the summary at the end” shape is the basic form of eBPF observation.

biolatency — reading the distribution of block I/O latency #

BCC’s biolatency measures the latency of every individual block device I/O and presents it as a histogram. Here’s a single 10-second aggregation.

biolatency 10 1
$ sudo biolatency 10 1
Tracing block device I/O... Hit Ctrl-C to end.

     usecs               : count     distribution
        64 -> 127        : 9        |*                                       |
       128 -> 255        : 156      |**********************                  |
       256 -> 511        : 274      |****************************************|
       512 -> 1023       : 92       |*************                           |
      1024 -> 2047       : 18       |**                                      |
     32768 -> 65535      : 5        |                                        |
     65536 -> 131071     : 4        |                                        |

The buckets are log-scale, doubling each step. You read it in two passes.

  • Where is the body? Most of the I/O clusters at 128–511us. For an SSD, that’s a normal range.
  • Is there a tail? There are 4 events in the 65–131ms range — I/Os hundreds of times slower than the body. In iostat’s average latency (await) they’d be buried under the body and invisible; in the histogram they show up as a separate, detached peak. The usual suspects for this tail are SSD-internal garbage collection, write cache flushes, and queue contention with other workloads — and that’s where the real analysis begins.

A case that a single average number would have closed as “the disk is fine” turns, in the distribution, into a traceable lead: “990 events are fine, 4 are anomalous.”

runqlat — how long tasks waited for a CPU #

There are servers where CPU utilization sits at 50% and the application is still slow. What you need to look at then is scheduler wait: the time a task that’s ready to run spends in the run queue before being granted a CPU. BCC’s runqlat shows this distribution.

runqlat 10 1
     usecs               : count     distribution
         0 -> 1          : 1306     |**********************                  |
         2 -> 3          : 2342     |****************************************|
         4 -> 7          : 1217     |*********************                   |
      8192 -> 16383      : 41       |                                        |
     16384 -> 32767      : 23       |                                        |

Most tasks get a CPU within a few microseconds, but in the tail more than 60 tasks waited 8–32ms. The classic causes of a tail like this despite low utilization are container CPU limits (once the cgroup quota is exhausted, tasks are forced to wait until the next period), work piling onto specific cores, and short bursts arriving simultaneously. If the load average from Intermediate #1 told you “there is a queue,” runqlat tells you how many milliseconds that queue actually held each task.

A map of the BCC tools #

BCC ships dozens of single-purpose tools in this style. The ones you’ll reach for most often:

ToolWhat it shows
execsnoopEvery newly executed process. Catches storms of short-lived processes
opensnoopEvery file open. Also handy for tracing where config files are read from
biolatency / biosnoopBlock I/O latency histogram / per-event detail
runqlatScheduler wait time distribution
ext4slowerPrints only filesystem operations slower than a threshold (variants exist for XFS and others)
cachestatPage cache hit ratio
tcplifeLifetime and bytes transferred per TCP connection. A per-connection summary
tcpretransWhen TCP retransmissions happen and to whom

There’s a naming convention here. -snoop streams events one by one, -lat aggregates a latency histogram, and -slower filters for only the slow ones. When you meet a new tool, the name tells you roughly what it does.

Overhead and production caveats #

eBPF is lightweight, but it isn’t free. The cost scales with event frequency.

  • Think about event frequency first. Process executions, which execsnoop hooks, run at tens per second — negligible. Scheduler switches, which runqlat hooks, can run at hundreds of thousands per second. On a busy server that can mean a few extra percent of CPU, so remember that at the worst moment of an incident, the measurement itself becomes load.
  • Aggregating tools are cheaper than per-event output. Histograms stay in the kernel and only the summary crosses over. For the same events, biolatency (aggregated) costs less than biosnoop (per-event).
  • Separate diagnostic tools from always-on monitoring. The tools in this post are meant to be switched on when a problem appears and switched off once you have your answer. If you need continuous collection, the right move is to evaluate a dedicated eBPF-based monitoring agent.
  • There are kernel version and privilege prerequisites. You need a reasonably recent kernel (RHEL 8, Ubuntu 18.04 or later) and root. Thanks to the verifier the kernel won’t die, but on older kernels the tools may simply fail to load.

Wrap-up #

The picture we built in this post:

  • Average metrics like %wa and load average flatten the distribution. The tail latency that degrades perceived quality is only visible in a histogram.
  • eBPF is a small program attached to kernel events; the verifier guarantees safety, and in-kernel aggregation keeps the cost low.
  • Use biolatency to separate the body of block I/O from its tail, and runqlat to confirm scheduler wait on servers that are slow despite low utilization.
  • Ask ad-hoc questions with bpftrace one-liners and structured ones with BCC tools, gauging overhead by event frequency and whether the tool aggregates.

Next — memory deep dive #

The next post, “Hardware Advanced #3: Memory Deep Dive,” steps into the second resource. Virtual memory and page tables, the cost of TLB misses, and the tug-of-war between page reclaim and the cache — how memory actually behaves behind the usage graph.

X