Go Advanced #6 Profiling — pprof and benchmark

5 min read

After #5 unsafe and cgo, this time a tool of the opposite flavor. Measurement.

“Don’t guess; measure.”

Performance issues almost always appear somewhere other than where you suspect. Go’s standard tools are powerful enough to move you quickly from guesswork to measurement.

benchmark — standard tooling #

simple benchmark
// adder_test.go
package adder

import "testing"

func BenchmarkAdd(b *testing.B) {
	for i := 0; i < b.N; i++ {
		Add(1, 2)
	}
}
run
go test -bench=. -benchmem
result
BenchmarkAdd-8    1000000000    0.30 ns/op    0 B/op    0 allocs/op
  • b.N — auto-tuned to make timing stable
  • -benchmem — adds allocation info
  • ns/op — time per single execution
  • B/op / allocs/op — bytes and count of allocations

Benchmark-writing fundamentals #

reset pattern
func BenchmarkParse(b *testing.B) {
	data := loadBigInput()    // heavy setup

	b.ResetTimer()             // measure from here
	for i := 0; i < b.N; i++ {
		Parse(data)
	}
}

Setup cost is excluded with ResetTimer. You can also pause and resume timing inside the loop with b.StopTimer and b.StartTimer.

Avoiding compiler optimizations #

common pitfall
func BenchmarkSum(b *testing.B) {
	for i := 0; i < b.N; i++ {
		sum(1, 2)    // ✗ if the result is unused, the compiler may eliminate it entirely
	}
}

Solution — assign the result to a package-level variable (so the compiler can’t eliminate it).

preserve the result
var benchResult int

func BenchmarkSum(b *testing.B) {
	var r int
	for i := 0; i < b.N; i++ {
		r = sum(1, 2)
	}
	benchResult = r
}

benchstat — comparing two results #

before/after comparison
go test -bench=. -count=10 > before.txt
# edit code
go test -bench=. -count=10 > after.txt

go install golang.org/x/perf/cmd/benchstat@latest
benchstat before.txt after.txt
              │   before   │            after            │
              │   sec/op   │   sec/op    vs base         │
Parse-8       │ 1.23µ ± 2% │ 0.85µ ± 1%  -30.89% (p=0.000 n=10)

The p value tells you whether the difference is statistically significant. A single run can be fooled by noise — -count=10 is recommended.

CPU profile #

CPU profile from a benchmark
go test -bench=. -cpuprofile=cpu.out
go tool pprof cpu.out
pprof interactive
(pprof) top
Showing nodes accounting for 1.23s, 87.86% of 1.4s total
      flat  flat%   sum%        cum   cum%
     0.43s 30.71% 30.71%      0.65s 46.43%  parse
     0.31s 22.14% 52.86%      0.31s 22.14%  hash
     ...

(pprof) list parse
(pprof) web      ← call graph in the browser

Use top to see which functions consume the most time, and list <fn> for line-by-line timing.

Memory profile #

go test -bench=. -memprofile=mem.out
go tool pprof -alloc_space mem.out

Two viewpoints:

  • -alloc_space — cumulative allocations (best shows GC load)
  • -inuse_space — currently live memory

In hot spots with many allocations, the GC runs frequently and throughput drops. Memory profiles are usually analyzed alongside escape analysis.

Escape analysis #

go build -gcflags='-m' main.go
example output
./main.go:5:9: &User{...} escapes to heap

This tells you why an object that could have stayed on the stack was allocated on the heap instead — the starting point for reducing allocations.

Production profiling — net/http/pprof #

pprof endpoint
import (
	"net/http"
	_ "net/http/pprof"    // /debug/pprof/* endpoints registered automatically
)

func main() {
	go http.ListenAndServe(":6060", nil)
	// the main server runs separately...
}
real-time profile
# CPU profile for 30 seconds
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

# current memory
go tool pprof http://localhost:6060/debug/pprof/heap

# current goroutines
go tool pprof http://localhost:6060/debug/pprof/goroutine

In production you can profile without stopping the service. Just be careful not to expose port 6060 publicly — keep it internal only.

Finding goroutine leaks #

curl http://localhost:6060/debug/pprof/goroutine?debug=1

This shows the stacks of every currently live goroutine. A growing count means a leak — the first tool to reach for when you suspect the leak patterns from Intermediate #3.

Trace — analysis along a time axis #

If CPU and memory profiles tell you “where,” trace tells you “when.”

collecting a trace
go test -bench=. -trace=trace.out
go tool trace trace.out

A browser opens showing goroutine scheduling, GC events, and system calls along a timeline. This is the right tool for time-axis problems like GC running too frequently or goroutines starving.

Race detector #

You saw it briefly in #2.

go test -race ./...
go run -race main.go

Always on in tests and local runs. In CI, run a separate race build as well. With roughly 5–10x overhead, do not enable it in production.

Measurement workflow #

A typical flow:

  1. Users/metrics report it’s slow
  2. CPU profile → which functions consume time
  3. (If memory is suspected) memory profile → where allocations occur
  4. Write a benchmark for the suspected part — a reproducible measurement
  5. Edit + benchstat to compare — did it actually get faster?
  6. Redeploy and reconfirm via metrics

The key is to verify with measurement at every step. Optimizing on guesses alone often makes things slower.

Common cases #

String concatenation #

+ in a for loop
var s string
for _, p := range parts {
	s += p    // ✗ allocates a new string each time
}

Solution — strings.Builder.

var b strings.Builder
for _, p := range parts {
	b.WriteString(p)
}
s := b.String()

Slice without preallocated capacity #

pre-size capacity
result := make([]int, 0, len(items))    // ✓ pre-size capacity
for _, x := range items {
	result = append(result, transform(x))
}

Starting with make([]int, 0) causes multiple reallocations as the slice grows. Pre-sizing keeps it to a single allocation.

Map capacity #

m := make(map[string]int, expectedSize)

Maps also benefit from an expected size hint — it reduces rehashing and reallocations.

Interface boxing #

cost of interface conversion
var any interface{} = 42    // int → interface{} boxing (potentially heap-allocated)

Frequent conversions to interface{} (or any) in hot loops accumulate allocations. Where possible, use concrete types.

Wrap-up #

What we covered:

  • benchmarkb.N, -benchmem, ResetTimer, preserve results to avoid optimization
  • benchstat — statistical comparison of two measurements
  • CPU profiletop, list, web
  • Memory profile-alloc_space is usually more useful
  • Escape analysis-gcflags='-m' for heap-allocation reasons
  • net/http/pprof — real-time profiling in production
  • trace — time axis, GC, scheduling
  • race detector — always on in tests
  • Workflow: metrics → profile → bench → edit → benchstat → reconfirm

In the next post (#7 Code Generation) — another path Go often recommends. How to automate without paying reflect’s cost, and standard tools like go generate and stringer.

X