Performance — cProfile, py-spy, memory profiling
The toolbox for finding and fixing slow Python code — timeit, cProfile, py-spy, line_profiler, memray, and common optimization patterns.
The last chapter of Part 3 — performance. When you get an “it’s slow” report, this chapter gives you the toolbox for measuring where and how it’s slow, and fixing it: timeit, cProfile, py-spy, line_profiler, memray, and common optimization patterns.
This chapter pairs with Chapter 19 GIL and concurrency. If Chapter 19 is “tool selection by bottleneck kind,” this chapter is the tools for “measuring where the bottleneck is”. The cycle of measure → classify hotspot → choose tool → re-measure is the standard flow of performance debugging.
First rule — don’t optimize without measuring #
"Premature optimization is the root of all evil." — Donald KnuthIt sounds tiresome every time you read it, but it’s almost always right. When intuition points to “this is probably slow,” it’s wrong about 70% of the time. Measurement is the first step.
timeit — small-unit measurement
#
import timeit
# measure one line
t = timeit.timeit("sum(range(1000))", number=10_000)
print(f"avg {t / 10_000 * 1e6:.2f} μs/call")
# setup code
t = timeit.timeit(
stmt="d.get('key')",
setup="d = {'key': 1}",
number=1_000_000,
)Small-unit comparisons — “is list comprehension faster or map,” “is f-string faster than +” kinds of cases.
CLI works too:
python -m timeit -s "import json" "json.dumps({'a': 1})"
# 1000000 loops, best of 5: 322 ns per loopcProfile — function-level profiling
#
Shows where CPU time goes, per function.
python -m cProfile -s cumulative myapp.py
# sort by cumulative timeOr in code:
import cProfile
import pstats
with cProfile.Profile() as pr:
do_work()
stats = pstats.Stats(pr).sort_stats("cumulative")
stats.print_stats(20) # top 20Output:
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.001 0.001 2.345 2.345 myapp.py:10(main)
1000 0.500 0.001 1.800 0.002 myapp.py:50(process_item)
100000 0.700 0.000 0.700 0.000 myapp.py:80(parse_line)How to read it:
tottime— time spent directly in that function’s body (excluding child functions)cumtime— cumulative time of that function + all child functionsncalls— call count
Hotspot candidates: high tottime, or the parent of a function with high cumtime.
Visualization — snakeviz #
uv add --dev snakeviz
python -m cProfile -o profile.out myapp.py
uvx snakeviz profile.outSee the function call tree in the browser as something like a flame graph. More intuitive than text output.
py-spy — profiling a running process
#
cProfile’s drawback: you have to modify code to wrap it. When you want to attach to a running production process, use py-spy.
uvx py-spy@latest top --pid 12345
# or start a new process
uvx py-spy@latest record -o flame.svg -- python myapp.pytop mode: real-time per-function CPU usage (like the top command)
record mode: record for a given time then produce a flame graph SVG
Value of py-spy:
- No source modification
- Sampling-based — overhead is very low (5 ~ 10%)
- C extensions visible too — you can analyze the insides of NumPy too
- Shows GIL hold time — with
--idlefor idle analysis
A tool for seeing “what’s slow right now” on the spot in production / staging.
line_profiler — line-level profiling
#
cProfile is function-level. When you want to see which line inside a function is slow.
uv add --dev line_profilerAttach the @profile decorator to the target function (line_profiler injects it).
@profile
def process(items):
parsed = [parse(x) for x in items] # per-line time measurement
filtered = [x for x in parsed if x.valid]
return filtereduv run kernprof -l -v myapp.pyOutput:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
2 1 1234567.0 1234567.0 85.3 parsed = [parse(x) for x in items]
3 1 200000.0 200000.0 13.8 filtered = [x for x in parsed if x.valid]
4 1 12000.0 12000.0 0.8 return filteredSee at a glance which line of the function takes the time share. Useful for detailed optimization. Measurement overhead is large (instrumented code is inserted), so use after the hotspot is narrowed down.
Memory profiling — memray
#
As often as CPU, what you need to measure is memory. Bloomberg’s memray has settled as the standard tool.
uv add --dev memray
uv run memray run myapp.py # produces *.bin
uv run memray flamegraph output.bin # HTML reportTracks memory leaks, peak memory usage spots, allocation call trees — even native memory.
tracemalloc — standard library
#
A lighter tool with no extra installation.
import tracemalloc
tracemalloc.start()
# ... work ...
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics("lineno")
for stat in top_stats[:10]:
print(stat)Shows where memory is held most, per line. Good as a light first step.
Common CPython performance pitfalls #
1) Global vs local variables #
Frequently referencing globals inside a function is slow. Receiving them once into a local is faster.
def process(items):
return [math.sqrt(x) for x in items] # math, math.sqrt looked up every time
# ✅ Bind to local
def process(items):
sqrt = math.sqrt
return [sqrt(x) for x in items]A small difference, but meaningful in hot loops.
2) Accumulating strings with +=
#
result = ""
for s in strings:
result += s # creates a new string each time
# ✅ join — O(n)
result = "".join(strings)Strings are immutable, so += makes a new object every time. The bigger the string, the more horribly slow it gets.
3) Searching a list with in
#
if x in big_list: # compares every element
# ✅ Convert to set — O(1)
big_set = set(big_list)
if x in big_set:If search frequency is high, switch to set / dict.
4) Wrong data structures #
| Job | Data structure |
|---|---|
| Push / pop at both ends | collections.deque (list’s pop(0) is O(n)) |
| Insert keeping sort | bisect module |
| Counting | collections.Counter |
| Priority queue | heapq |
| Dict with defaults | collections.defaultdict |
Python’s standard library has almost all of these — don’t reinvent, pull and use them.
NumPy / vectorization #
For numeric computation, NumPy instead of loops is almost always faster.
result = [a[i] * b[i] for i in range(len(a))]import numpy as np
result = np.array(a) * np.array(b) # processed concurrently at the C level100x ~ 1000x differences are common. However, there’s a data conversion cost, so it can be slower for small arrays. Measure then apply.
Caching — functools.cache
#
A tool seen in Chapter 12 decorator patterns. The most effective optimization for pure functions called repeatedly with the same arguments.
from functools import cache
@cache
def expensive(n: int) -> int:
...The function must be pure, and arguments must be hashable.
__slots__ — save instance memory
#
A tool seen in Chapter 8 dataclass and __slots__. When you build tens of thousands of objects, this brings the biggest effect.
@dataclass(slots=True)
class Point:
x: float
y: float40 ~ 50% memory saving per instance, 10 ~ 25% attribute access speed-up.
Cython / Rust extensions — the last weapon #
When pure Python can’t do it, go down to the C level.
- Cython — Python-like syntax compiling to C. Allows gradual conversion.
- PyO3 (Rust) — write extension modules in Rust.
maturinis the build tool. - mypyc — compile typed-hinted Python to C (mypy itself uses this).
Common rule: hotspots only. Don’t migrate the whole codebase; moving the narrow part found by cProfile is the best return on cost.
Other interpreters — a quick check #
- PyPy — a separate implementation with a JIT compiler. Pure Python code often runs 5 ~ 10x faster. Weakness is C extension compatibility, so it doesn’t fit NumPy / Pandas heavy code.
- Free-threaded CPython (Chapter 19 GIL and concurrency) — 5 ~ 10% single-thread loss, big gain in multi-thread.
Depending on the situation, swapping the interpreter itself can be the biggest change.
Performance of async #
When measuring the performance of the async code from Chapter 18 async in depth:
PYTHONASYNCIODEBUG=1 uvx py-spy@latest record -o async.svg -- python app.pypy-spy also analyzes async code well. It shows which coroutine gets stuck where in an await.
In practice — performance debugging flow #
- Reproducible benchmark — measurement only matters if the same input gives the same result in the same time
- Check total time with
time— whether 1 second or 1 minute decides tool choice cProfileorpy-spyto find hotspotsline_profilerfor per-line analysis inside hot functions- Check for common pitfalls — list
in, global lookups, string accumulation - Swap data structures — set / deque / Counter, etc.
- Vectorization — whether NumPy applies
- Caching — repeated calls with same arguments?
- C-level extensions — last resort
At each step, re-measure to confirm it actually got faster. “This will be faster” intuition is often wrong.
Exercises #
- Compare with
timeitthe time of code that accumulates 10,000 strings with+=against code that joins with"".join(). Vary n through 10 / 100 / 10000 / 100000 and observe where the difference between O(n²) and O(n) becomes obvious. - Attach
cProfileto actual code. (If you don’t have your own code, the simple mathkit module from Chapter 7 works.) Then visualize withsnakeviz. Find the function with the largest cumulative time, do one-line optimization (data structure swap, etc.), and re-measure to confirm the effect. - With
tracemalloc, find the line in your code that holds the most memory. Measure the same job again withmemrayand compare the differences in output between the two tools.
In one line: 70% of optimization without measurement is wasted. Toolbox:
timeit(micro) /cProfile+ snakeviz (function) /py-spy(running) /line_profiler(line) /memray+tracemalloc(memory). Common pitfalls are global lookups, string+=, listin, wrong data structures. Data structures (deque / bisect / Counter / heapq / defaultdict) → vectorization (NumPy) → caching (@cache) →__slots__→ C extensions (Cython / PyO3 / mypyc) → interpreter (PyPy / free-threaded). Re-measure at each step.
Part 3 wrap-up #
Through 7 chapters of Part 3, the depth · concurrency toolbox has been filled.
- Magic methods — hooks where objects meet language features
- Descriptors — turning attributes into objects
- Metaclasses — classes that build classes (usually not used)
- Async in depth — event loop, Future / Task, async generator
- GIL and concurrency — threading vs multiprocessing vs asyncio + free-threaded
- Advanced typing — variance, ParamSpec, TypeIs, overload, Annotated
- Performance — measurement tools and optimization patterns
With this, 21 chapters covering Part 1 (intro) → Part 2 (structuring) → Part 3 (depth · concurrency) are complete. Next, Part 4 FastAPI in production is the stage where every tool sharpened so far comes together in one project.
Next chapter #
Next, Chapter 22 starting and setting up FastAPI is the start of Part 4 — the first chapter of FastAPI in production, 6 chapters + 2 new (Chapter 24 Pydantic v2 in depth, Chapter 29 TODO API capstone). Covers Hello FastAPI, OpenAPI auto-generation, and the first project setup with uv.