Go Advanced #6 Profiling — pprof and benchmark

Programming Language Go Go Profiling Performance

Sunday, February 8, 2026

5 min read

After #5 unsafe and cgo, this time a tool of the opposite flavor. Measurement.

“Don’t guess; measure.”

Performance issues almost always appear somewhere other than where you suspect. Go’s standard tools are powerful enough to move you quickly from guesswork to measurement.

benchmark — standard tooling #

simple benchmark

// adder_test.go
package adder

import "testing"

func BenchmarkAdd(b *testing.B) {
	for i := 0; i < b.N; i++ {
		Add(1, 2)
	}
}

run

go test -bench=. -benchmem

result

BenchmarkAdd-8    1000000000    0.30 ns/op    0 B/op    0 allocs/op

b.N — auto-tuned to make timing stable
-benchmem — adds allocation info
ns/op — time per single execution
B/op / allocs/op — bytes and count of allocations

Benchmark-writing fundamentals #

reset pattern

func BenchmarkParse(b *testing.B) {
	data := loadBigInput()    // heavy setup

	b.ResetTimer()             // measure from here
	for i := 0; i < b.N; i++ {
		Parse(data)
	}
}

Setup cost is excluded with ResetTimer. You can also pause and resume timing inside the loop with b.StopTimer and b.StartTimer.

Avoiding compiler optimizations #

common pitfall

func BenchmarkSum(b *testing.B) {
	for i := 0; i < b.N; i++ {
		sum(1, 2)    // ✗ if the result is unused, the compiler may eliminate it entirely
	}
}

Solution — assign the result to a package-level variable (so the compiler can’t eliminate it).

preserve the result

var benchResult int

func BenchmarkSum(b *testing.B) {
	var r int
	for i := 0; i < b.N; i++ {
		r = sum(1, 2)
	}
	benchResult = r
}

benchstat — comparing two results #

before/after comparison

go test -bench=. -count=10 > before.txt
# edit code
go test -bench=. -count=10 > after.txt

go install golang.org/x/perf/cmd/benchstat@latest
benchstat before.txt after.txt

              │   before   │            after            │
              │   sec/op   │   sec/op    vs base         │
Parse-8       │ 1.23µ ± 2% │ 0.85µ ± 1%  -30.89% (p=0.000 n=10)

The p value tells you whether the difference is statistically significant. A single run can be fooled by noise — -count=10 is recommended.

CPU profile #

CPU profile from a benchmark

go test -bench=. -cpuprofile=cpu.out
go tool pprof cpu.out

pprof interactive

(pprof) top
Showing nodes accounting for 1.23s, 87.86% of 1.4s total
      flat  flat%   sum%        cum   cum%
     0.43s 30.71% 30.71%      0.65s 46.43%  parse
     0.31s 22.14% 52.86%      0.31s 22.14%  hash
     ...

(pprof) list parse
(pprof) web      ← call graph in the browser

Use top to see which functions consume the most time, and list <fn> for line-by-line timing.

Memory profile #

go test -bench=. -memprofile=mem.out
go tool pprof -alloc_space mem.out

Two viewpoints:

-alloc_space — cumulative allocations (best shows GC load)
-inuse_space — currently live memory

In hot spots with many allocations, the GC runs frequently and throughput drops. Memory profiles are usually analyzed alongside escape analysis.

Escape analysis #

go build -gcflags='-m' main.go

example output

./main.go:5:9: &User{...} escapes to heap

This tells you why an object that could have stayed on the stack was allocated on the heap instead — the starting point for reducing allocations.

Production profiling — net/http/pprof #

pprof endpoint

import (
	"net/http"
	_ "net/http/pprof"    // /debug/pprof/* endpoints registered automatically
)

func main() {
	go http.ListenAndServe(":6060", nil)
	// the main server runs separately...
}

real-time profile

# CPU profile for 30 seconds
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

# current memory
go tool pprof http://localhost:6060/debug/pprof/heap

# current goroutines
go tool pprof http://localhost:6060/debug/pprof/goroutine

In production you can profile without stopping the service. Just be careful not to expose port 6060 publicly — keep it internal only.

Finding goroutine leaks #

curl http://localhost:6060/debug/pprof/goroutine?debug=1

This shows the stacks of every currently live goroutine. A growing count means a leak — the first tool to reach for when you suspect the leak patterns from Intermediate #3.

Trace — analysis along a time axis #

If CPU and memory profiles tell you “where,” trace tells you “when.”

collecting a trace

go test -bench=. -trace=trace.out
go tool trace trace.out

A browser opens showing goroutine scheduling, GC events, and system calls along a timeline. This is the right tool for time-axis problems like GC running too frequently or goroutines starving.

Race detector #

You saw it briefly in #2.

go test -race ./...
go run -race main.go

Always on in tests and local runs. In CI, run a separate race build as well. With roughly 5–10x overhead, do not enable it in production.

Measurement workflow #

A typical flow:

Users/metrics report it’s slow
CPU profile → which functions consume time
(If memory is suspected) memory profile → where allocations occur
Write a benchmark for the suspected part — a reproducible measurement
Edit + benchstat to compare — did it actually get faster?
Redeploy and reconfirm via metrics

The key is to verify with measurement at every step. Optimizing on guesses alone often makes things slower.

Common cases #

String concatenation #

+ in a for loop

var s string
for _, p := range parts {
	s += p    // ✗ allocates a new string each time
}

Solution — strings.Builder.

var b strings.Builder
for _, p := range parts {
	b.WriteString(p)
}
s := b.String()

Slice without preallocated capacity #

pre-size capacity

result := make([]int, 0, len(items))    // ✓ pre-size capacity
for _, x := range items {
	result = append(result, transform(x))
}

Starting with make([]int, 0) causes multiple reallocations as the slice grows. Pre-sizing keeps it to a single allocation.

Map capacity #

m := make(map[string]int, expectedSize)

Maps also benefit from an expected size hint — it reduces rehashing and reallocations.

Interface boxing #

cost of interface conversion

var any interface{} = 42    // int → interface{} boxing (potentially heap-allocated)

Frequent conversions to interface{} (or any) in hot loops accumulate allocations. Where possible, use concrete types.

Wrap-up #

What we covered:

benchmark — b.N, -benchmem, ResetTimer, preserve results to avoid optimization
benchstat — statistical comparison of two measurements
CPU profile — top, list, web
Memory profile — -alloc_space is usually more useful
Escape analysis — -gcflags='-m' for heap-allocation reasons
net/http/pprof — real-time profiling in production
trace — time axis, GC, scheduling
race detector — always on in tests
Workflow: metrics → profile → bench → edit → benchstat → reconfirm

In the next post (#7 Code Generation) — another path Go often recommends. How to automate without paying reflect’s cost, and standard tools like go generate and stringer.