Modern Python Advanced #7 Performance — cProfile, py-spy, Memory Profiling
The last post of the advanced series — performance. When you get a “this is slow” report, here is the toolbox for measuring where and how it’s slow, and fixing it: timeit, cProfile, py-spy, line_profiler, memray, and common optimization patterns.
First rule — don’t optimize without measuring #
"Premature optimization is the root of all evil." — Donald KnuthIt always sounds a little tired to read, but it’s almost always right. When you guess “this part is going to be slow” by intuition, you’re wrong about 70% of the time. Measurement is step one.
timeit — measuring small units
#
import timeit
# Time a one-liner
t = timeit.timeit("sum(range(1000))", number=10_000)
print(f"average {t / 10_000 * 1e6:.2f} μs/run")
# With setup code
t = timeit.timeit(
stmt="d.get('key')",
setup="d = {'key': 1}",
number=1_000_000,
)Useful for comparing small units — “is a list comprehension faster than map,” “is an f-string faster than +,” that kind of question.
It also works from the CLI:
python -m timeit -s "import json" "json.dumps({'a': 1})"
# 1000000 loops, best of 5: 322 ns per loopcProfile — function-level profiling
#
Shows where CPU time is spent, per function.
python -m cProfile -s cumulative myapp.py
# sorted by cumulative timeOr from code:
import cProfile
import pstats
with cProfile.Profile() as pr:
do_work()
stats = pstats.Stats(pr).sort_stats("cumulative")
stats.print_stats(20) # top 20Output:
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.001 0.001 2.345 2.345 myapp.py:10(main)
1000 0.500 0.001 1.800 0.002 myapp.py:50(process_item)
100000 0.700 0.000 0.700 0.000 myapp.py:80(parse_line)How to read it:
tottime— time spent directly in the body of that function (excluding child calls)cumtime— cumulative time of that function plus all its childrenncalls— number of calls
Hot-spot candidates: large tottime, or the parent of a function with large cumtime.
Visualization — snakeviz #
uv add --dev snakeviz
python -m cProfile -o profile.out myapp.py
uvx snakeviz profile.outYou see the call tree as a flame-graph-like view in the browser. Much more intuitive than text output.
py-spy — profiling running processes
#
cProfile’s downside: you have to modify the code to wrap it. When you want to attach to a running production process, py-spy is the answer.
uvx py-spy@latest top --pid 12345
# or start a new process
uvx py-spy@latest record -o flame.svg -- python myapp.pytop mode: real-time per-function CPU usage (like the top command).
record mode: record for a duration and emit a flame graph SVG.
Why py-spy is valuable:
- No source modification needed
- Sampling-based — very low overhead (5~10%)
- Shows C extensions — can analyze NumPy internals, etc.
- GIL hold time is shown too —
--idleoption for idle analysis
A tool for seeing “what’s slow right now” in production / staging on the fly.
line_profiler — line-level profiling
#
cProfile is per-function. When you want to see which line inside a function is slow.
uv add --dev line_profilerAttach @profile (injected by line_profiler) to the target function.
@profile
def process(items):
parsed = [parse(x) for x in items] # measure each line
filtered = [x for x in parsed if x.valid]
return filtereduv run kernprof -l -v myapp.pyOutput:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
2 1 1234567.0 1234567.0 85.3 parsed = [parse(x) for x in items]
3 1 200000.0 200000.0 13.8 filtered = [x for x in parsed if x.valid]
4 1 12000.0 12000.0 0.8 return filteredYou see at a glance which line in the function dominates. Very useful for detailed optimization. Note that the measurement overhead is high (it inserts instrumentation), so use it after you’ve narrowed down the hot spot.
Memory profiling — memray
#
You should measure memory as often as you measure CPU. Bloomberg’s memray is the go-to tool for this.
uv add --dev memray
uv run memray run myapp.py # produces *.bin
uv run memray flamegraph output.bin # HTML reportMemory leak tracing, peak usage location, the allocation call tree — it tracks even native memory.
tracemalloc — standard library
#
A lighter-weight tool that requires no extra install.
import tracemalloc
tracemalloc.start()
# ... work ...
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics("lineno")
for stat in top_stats[:10]:
print(stat)It shows you, line by line, where memory is most heavily held at that point in time. A nice lightweight first step.
Common CPython performance pitfalls #
1) Global vs local variables #
Frequently referencing globals inside a function is slow. Capturing them to a local once is faster.
def process(items):
return [math.sqrt(x) for x in items] # math, math.sqrt looked up each time
# ✅ Bind to local
def process(items):
sqrt = math.sqrt
return [sqrt(x) for x in items]A small difference, but meaningful in hot loops.
2) Accumulating strings with +=
#
result = ""
for s in strings:
result += s # creates a new string every time
# ✅ join — O(n)
result = "".join(strings)Strings are immutable, so += creates a new object each time. The bigger the string, the more horrendously slow.
3) Searching a list with in
#
if x in big_list: # compares every element
# ✅ Convert to set — O(1)
big_set = set(big_list)
if x in big_set:If lookup frequency is high, switch to set/dict.
4) Wrong data structure #
| Job | Data structure |
|---|---|
| push/pop on both ends | collections.deque (list’s pop(0) is O(n)) |
| Insert while keeping sorted | bisect module |
| Counting | collections.Counter |
| Priority queue | heapq |
| dict with default | collections.defaultdict |
Almost everything is in Python’s standard library — don’t build it, use it.
NumPy / vectorization #
For numerical computation, NumPy instead of loops is almost always faster.
result = [a[i] * b[i] for i in range(len(a))]import numpy as np
result = np.array(a) * np.array(b) # parallel at the C levelDifferences of 100x to 1000x are common. That said, there is a data conversion cost, so for small arrays it can be slower. Measure first, then apply.
Caching — functools.cache
#
The tool you saw in Intermediate #5. The most effective optimization for pure functions called repeatedly with the same arguments.
from functools import cache
@cache
def expensive(n: int) -> int:
...The function must be pure, and arguments must be hashable.
__slots__ — saving instance memory
#
What you saw in Intermediate #1. When you create tens of thousands of objects, this gives the biggest win.
@dataclass(slots=True)
class Point:
x: float
y: float40~50% memory savings per instance, 10~25% faster attribute access.
Cython / Rust extensions — the last weapon #
When pure Python isn’t enough, drop to the C level.
- Cython — Python-like syntax compiled to C. Allows incremental conversion.
- PyO3 (Rust) — Write extension modules in Rust.
maturinis the build tool. - mypyc — Compiles type-hinted Python to C (mypy itself uses this approach).
The common rule: target only the hot spots. Don’t rewrite everything — moving just the narrow spots that cProfile identified gives the best cost-to-benefit ratio.
Other interpreters — worth a look #
- PyPy — A separate implementation with a JIT compiler. Pure Python code is often 5~10x faster. Weak C-extension compatibility, so it doesn’t fit NumPy/Pandas-heavy code.
- Free-threaded CPython (#5) — 5~10% loss on single-thread, big wins on multi-thread.
Depending on the situation, switching the interpreter itself can be the single biggest win.
Async performance #
When measuring async code from #4:
PYTHONASYNCIODEBUG=1 uvx py-spy@latest record -o async.svg -- python app.pypy-spy analyzes async code well too. It shows you which coroutine is blocked at which await.
In practice — performance debugging flow #
- A reproducible benchmark — same input, same result, same time, otherwise measurement is meaningless
- Check overall time with
time— pick tools based on whether it’s 1 second or 1 minute - Find hot spots with
cProfileorpy-spy - Use
line_profilerfor line-level analysis of the hot function - Check common pitfalls — list
in, global lookups, string accumulation - Change data structures — set/deque/Counter, etc.
- Vectorize — apply NumPy where possible
- Caching — same-arg repeated calls?
- C-level extensions — last resort
At each step, measure again to confirm it actually got faster. The “this will be faster” intuition is often wrong.
Wrap-up + series retrospective #
The toolbox covered in this post:
timeit— small-unit measurementcProfile+snakeviz— function-level profilepy-spy— running process, low overheadline_profiler— line levelmemray+tracemalloc— memory- Frequent pitfalls — global lookups, string
+=, listin, wrong data structures - Data structures:
deque,bisect,Counter,heapq,defaultdict - NumPy vectorization,
functools.cache,__slots__ - Last resorts: Cython, PyO3, mypyc, PyPy, free-threaded CPython
- Flow: measure → hot spots → data structures/algorithms → vectorize → cache → extend → measure again
Series retrospective #
In 7 posts, the Modern Python Advanced toolkit is filled in.
- Magic methods — hooks where objects meet the language
- Descriptors — turning attributes into objects
- Metaclasses — classes that make classes (usually you don’t use them)
- Async deep dive — event loop, Future/Task, async generator
- GIL and concurrency — threading vs multiprocessing vs asyncio + free-threaded
- Advanced typing — variance, ParamSpec, TypeIs, overload, Annotated
- Performance — measurement tools and optimization patterns
That completes the 21 posts of Modern Python Basics → Intermediate → Advanced. The next series is Modern Python in Practice — Building APIs with FastAPI (6 posts). The place where every tool you’ve sharpened so far comes together in one project.
- Setup and start — Hello FastAPI, automatic OpenAPI generation
- Routing, Pydantic models, dependency injection
- DB integration — SQLAlchemy 2.x + Alembic
- Authentication — OAuth2 password flow + JWT
- Async and background work
- Testing and deployment — pytest, Docker, Railway/Fly