21 Chapter

Performance — cProfile, py-spy, memory profiling

The toolbox for finding and fixing slow Python code — timeit, cProfile, py-spy, line_profiler, memray, and common optimization patterns.

The last chapter of Part 3 — performance. When you get an “it’s slow” report, this chapter gives you the toolbox for measuring where and how it’s slow, and fixing it: timeit, cProfile, py-spy, line_profiler, memray, and common optimization patterns.

This chapter pairs with Chapter 19 GIL and concurrency. If Chapter 19 is “tool selection by bottleneck kind,” this chapter is the tools for “measuring where the bottleneck is”. The cycle of measure → classify hotspot → choose tool → re-measure is the standard flow of performance debugging.

First rule — don’t optimize without measuring #

A famous quote

"Premature optimization is the root of all evil." — Donald Knuth

It sounds tiresome every time you read it, but it’s almost always right. When intuition points to “this is probably slow,” it’s wrong about 70% of the time. Measurement is the first step.

`timeit` — small-unit measurement #

timeit

import timeit

# measure one line
t = timeit.timeit("sum(range(1000))", number=10_000)
print(f"avg {t / 10_000 * 1e6:.2f} μs/call")

# setup code
t = timeit.timeit(
    stmt="d.get('key')",
    setup="d = {'key': 1}",
    number=1_000_000,
)

Small-unit comparisons — “is list comprehension faster or map,” “is f-string faster than +” kinds of cases.

CLI works too:

CLI

python -m timeit -s "import json" "json.dumps({'a': 1})"
# 1000000 loops, best of 5: 322 ns per loop

`cProfile` — function-level profiling #

Shows where CPU time goes, per function.

Run cProfile

python -m cProfile -s cumulative myapp.py
# sort by cumulative time

Or in code:

In code

import cProfile
import pstats

with cProfile.Profile() as pr:
    do_work()

stats = pstats.Stats(pr).sort_stats("cumulative")
stats.print_stats(20)    # top 20

Output:

cProfile output

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001    2.345    2.345 myapp.py:10(main)
     1000    0.500    0.001    1.800    0.002 myapp.py:50(process_item)
   100000    0.700    0.000    0.700    0.000 myapp.py:80(parse_line)

How to read it:

tottime — time spent directly in that function’s body (excluding child functions)
cumtime — cumulative time of that function + all child functions
ncalls — call count

Hotspot candidates: high tottime, or the parent of a function with high cumtime.

Visualization — snakeviz #

snakeviz

uv add --dev snakeviz
python -m cProfile -o profile.out myapp.py
uvx snakeviz profile.out

See the function call tree in the browser as something like a flame graph. More intuitive than text output.

`py-spy` — profiling a running process #

cProfile’s drawback: you have to modify code to wrap it. When you want to attach to a running production process, use py-spy.

py-spy

uvx py-spy@latest top --pid 12345
# or start a new process
uvx py-spy@latest record -o flame.svg -- python myapp.py

top mode: real-time per-function CPU usage (like the top command) record mode: record for a given time then produce a flame graph SVG

Value of py-spy:

No source modification
Sampling-based — overhead is very low (5 ~ 10%)
C extensions visible too — you can analyze the insides of NumPy too
Shows GIL hold time — with --idle for idle analysis

A tool for seeing “what’s slow right now” on the spot in production / staging.

`line_profiler` — line-level profiling #

cProfile is function-level. When you want to see which line inside a function is slow.

line_profiler

uv add --dev line_profiler

Attach the @profile decorator to the target function (line_profiler injects it).

Target function

@profile
def process(items):
    parsed = [parse(x) for x in items]    # per-line time measurement
    filtered = [x for x in parsed if x.valid]
    return filtered

Run

uv run kernprof -l -v myapp.py

Output:

line_profiler output

Line #      Hits         Time  Per Hit  % Time  Line Contents
==============================================================
     2         1     1234567.0  1234567.0   85.3      parsed = [parse(x) for x in items]
     3         1      200000.0   200000.0   13.8      filtered = [x for x in parsed if x.valid]
     4         1       12000.0    12000.0    0.8      return filtered

See at a glance which line of the function takes the time share. Useful for detailed optimization. Measurement overhead is large (instrumented code is inserted), so use after the hotspot is narrowed down.

Memory profiling — `memray` #

As often as CPU, what you need to measure is memory. Bloomberg’s memray has settled as the standard tool.

memray

uv add --dev memray
uv run memray run myapp.py     # produces *.bin
uv run memray flamegraph output.bin  # HTML report

Tracks memory leaks, peak memory usage spots, allocation call trees — even native memory.

`tracemalloc` — standard library #

A lighter tool with no extra installation.

tracemalloc

import tracemalloc

tracemalloc.start()

# ... work ...

snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics("lineno")
for stat in top_stats[:10]:
    print(stat)

Shows where memory is held most, per line. Good as a light first step.

Common CPython performance pitfalls #

1) Global vs local variables #

Frequently referencing globals inside a function is slow. Receiving them once into a local is faster.

🚫 Repeated global lookup

def process(items):
    return [math.sqrt(x) for x in items]  # math, math.sqrt looked up every time

# ✅ Bind to local
def process(items):
    sqrt = math.sqrt
    return [sqrt(x) for x in items]

A small difference, but meaningful in hot loops.

2) Accumulating strings with `+=` #

🚫 O(n²)

result = ""
for s in strings:
    result += s    # creates a new string each time

# ✅ join — O(n)
result = "".join(strings)

Strings are immutable, so += makes a new object every time. The bigger the string, the more horribly slow it gets.

3) Searching a list with `in` #

🚫 list's in — O(n)

if x in big_list:    # compares every element

# ✅ Convert to set — O(1)
big_set = set(big_list)
if x in big_set:

If search frequency is high, switch to set / dict.

4) Wrong data structures #

Job	Data structure
Push / pop at both ends	`collections.deque` (list’s `pop(0)` is O(n))
Insert keeping sort	`bisect` module
Counting	`collections.Counter`
Priority queue	`heapq`
Dict with defaults	`collections.defaultdict`

Python’s standard library has almost all of these — don’t reinvent, pull and use them.

NumPy / vectorization #

For numeric computation, NumPy instead of loops is almost always faster.

🚫 Python loop

result = [a[i] * b[i] for i in range(len(a))]

✅ NumPy vectorization

import numpy as np
result = np.array(a) * np.array(b)    # processed concurrently at the C level

100x ~ 1000x differences are common. However, there’s a data conversion cost, so it can be slower for small arrays. Measure then apply.

Caching — `functools.cache` #

A tool seen in Chapter 12 decorator patterns. The most effective optimization for pure functions called repeatedly with the same arguments.

cache

from functools import cache

@cache
def expensive(n: int) -> int:
    ...

The function must be pure, and arguments must be hashable.

`slots` — save instance memory #

A tool seen in Chapter 8 dataclass and __slots__. When you build tens of thousands of objects, this brings the biggest effect.

dataclass(slots=True)

@dataclass(slots=True)
class Point:
    x: float
    y: float

40 ~ 50% memory saving per instance, 10 ~ 25% attribute access speed-up.

Cython / Rust extensions — the last weapon #

When pure Python can’t do it, go down to the C level.

Cython — Python-like syntax compiling to C. Allows gradual conversion.
PyO3 (Rust) — write extension modules in Rust. maturin is the build tool.
mypyc — compile typed-hinted Python to C (mypy itself uses this).

Common rule: hotspots only. Don’t migrate the whole codebase; moving the narrow part found by cProfile is the best return on cost.

Other interpreters — a quick check #

PyPy — a separate implementation with a JIT compiler. Pure Python code often runs 5 ~ 10x faster. Weakness is C extension compatibility, so it doesn’t fit NumPy / Pandas heavy code.
Free-threaded CPython (Chapter 19 GIL and concurrency) — 5 ~ 10% single-thread loss, big gain in multi-thread.

Depending on the situation, swapping the interpreter itself can be the biggest change.

Performance of async #

When measuring the performance of the async code from Chapter 18 async in depth:

asyncio debug + profile

PYTHONASYNCIODEBUG=1 uvx py-spy@latest record -o async.svg -- python app.py

py-spy also analyzes async code well. It shows which coroutine gets stuck where in an await.

In practice — performance debugging flow #

Reproducible benchmark — measurement only matters if the same input gives the same result in the same time
Check total time with time — whether 1 second or 1 minute decides tool choice
cProfile or py-spy to find hotspots
line_profiler for per-line analysis inside hot functions
Check for common pitfalls — list in, global lookups, string accumulation
Swap data structures — set / deque / Counter, etc.
Vectorization — whether NumPy applies
Caching — repeated calls with same arguments?
C-level extensions — last resort

At each step, re-measure to confirm it actually got faster. “This will be faster” intuition is often wrong.

Exercises #

Compare with timeit the time of code that accumulates 10,000 strings with += against code that joins with "".join(). Vary n through 10 / 100 / 10000 / 100000 and observe where the difference between O(n²) and O(n) becomes obvious.
Attach cProfile to actual code. (If you don’t have your own code, the simple mathkit module from Chapter 7 works.) Then visualize with snakeviz. Find the function with the largest cumulative time, do one-line optimization (data structure swap, etc.), and re-measure to confirm the effect.
With tracemalloc, find the line in your code that holds the most memory. Measure the same job again with memray and compare the differences in output between the two tools.

In one line: 70% of optimization without measurement is wasted. Toolbox: timeit (micro) / cProfile + snakeviz (function) / py-spy (running) / line_profiler (line) / memray + tracemalloc (memory). Common pitfalls are global lookups, string +=, list in, wrong data structures. Data structures (deque / bisect / Counter / heapq / defaultdict) → vectorization (NumPy) → caching (@cache) → __slots__ → C extensions (Cython / PyO3 / mypyc) → interpreter (PyPy / free-threaded). Re-measure at each step.

Part 3 wrap-up #

Through 7 chapters of Part 3, the depth · concurrency toolbox has been filled.

Magic methods — hooks where objects meet language features
Descriptors — turning attributes into objects
Metaclasses — classes that build classes (usually not used)
Async in depth — event loop, Future / Task, async generator
GIL and concurrency — threading vs multiprocessing vs asyncio + free-threaded
Advanced typing — variance, ParamSpec, TypeIs, overload, Annotated
Performance — measurement tools and optimization patterns

With this, 21 chapters covering Part 1 (intro) → Part 2 (structuring) → Part 3 (depth · concurrency) are complete. Next, Part 4 FastAPI in production is the stage where every tool sharpened so far comes together in one project.

Next chapter #

Next, Chapter 22 starting and setting up FastAPI is the start of Part 4 — the first chapter of FastAPI in production, 6 chapters + 2 new (Chapter 24 Pydantic v2 in depth, Chapter 29 TODO API capstone). Covers Hello FastAPI, OpenAPI auto-generation, and the first project setup with uv.

First rule — don’t optimize without measuring #

timeit — small-unit measurement #

cProfile — function-level profiling #