Modern Python Advanced #7 Performance — cProfile, py-spy, Memory Profiling

8 min read

The last post of the advanced series — performance. When you get a “this is slow” report, here is the toolbox for measuring where and how it’s slow, and fixing it: timeit, cProfile, py-spy, line_profiler, memray, and common optimization patterns.

First rule — don’t optimize without measuring #

A famous quote
"Premature optimization is the root of all evil." — Donald Knuth

It always sounds a little tired to read, but it’s almost always right. When you guess “this part is going to be slow” by intuition, you’re wrong about 70% of the time. Measurement is step one.

timeit — measuring small units #

timeit
import timeit

# Time a one-liner
t = timeit.timeit("sum(range(1000))", number=10_000)
print(f"average {t / 10_000 * 1e6:.2f} μs/run")

# With setup code
t = timeit.timeit(
    stmt="d.get('key')",
    setup="d = {'key': 1}",
    number=1_000_000,
)

Useful for comparing small units — “is a list comprehension faster than map,” “is an f-string faster than +,” that kind of question.

It also works from the CLI:

CLI
python -m timeit -s "import json" "json.dumps({'a': 1})"
# 1000000 loops, best of 5: 322 ns per loop

cProfile — function-level profiling #

Shows where CPU time is spent, per function.

Running cProfile
python -m cProfile -s cumulative myapp.py
# sorted by cumulative time

Or from code:

From code
import cProfile
import pstats

with cProfile.Profile() as pr:
    do_work()

stats = pstats.Stats(pr).sort_stats("cumulative")
stats.print_stats(20)    # top 20

Output:

cProfile output
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001    2.345    2.345 myapp.py:10(main)
     1000    0.500    0.001    1.800    0.002 myapp.py:50(process_item)
   100000    0.700    0.000    0.700    0.000 myapp.py:80(parse_line)

How to read it:

  • tottime — time spent directly in the body of that function (excluding child calls)
  • cumtime — cumulative time of that function plus all its children
  • ncalls — number of calls

Hot-spot candidates: large tottime, or the parent of a function with large cumtime.

Visualization — snakeviz #

snakeviz
uv add --dev snakeviz
python -m cProfile -o profile.out myapp.py
uvx snakeviz profile.out

You see the call tree as a flame-graph-like view in the browser. Much more intuitive than text output.

py-spy — profiling running processes #

cProfile’s downside: you have to modify the code to wrap it. When you want to attach to a running production process, py-spy is the answer.

py-spy
uvx py-spy@latest top --pid 12345
# or start a new process
uvx py-spy@latest record -o flame.svg -- python myapp.py

top mode: real-time per-function CPU usage (like the top command). record mode: record for a duration and emit a flame graph SVG.

Why py-spy is valuable:

  • No source modification needed
  • Sampling-based — very low overhead (5~10%)
  • Shows C extensions — can analyze NumPy internals, etc.
  • GIL hold time is shown too — --idle option for idle analysis

A tool for seeing “what’s slow right now” in production / staging on the fly.

line_profiler — line-level profiling #

cProfile is per-function. When you want to see which line inside a function is slow.

line_profiler
uv add --dev line_profiler

Attach @profile (injected by line_profiler) to the target function.

Target function
@profile
def process(items):
    parsed = [parse(x) for x in items]    # measure each line
    filtered = [x for x in parsed if x.valid]
    return filtered
Run
uv run kernprof -l -v myapp.py

Output:

line_profiler output
Line #      Hits         Time  Per Hit  % Time  Line Contents
==============================================================
     2         1     1234567.0  1234567.0   85.3      parsed = [parse(x) for x in items]
     3         1      200000.0   200000.0   13.8      filtered = [x for x in parsed if x.valid]
     4         1       12000.0    12000.0    0.8      return filtered

You see at a glance which line in the function dominates. Very useful for detailed optimization. Note that the measurement overhead is high (it inserts instrumentation), so use it after you’ve narrowed down the hot spot.

Memory profiling — memray #

You should measure memory as often as you measure CPU. Bloomberg’s memray is the go-to tool for this.

memray
uv add --dev memray
uv run memray run myapp.py     # produces *.bin
uv run memray flamegraph output.bin  # HTML report

Memory leak tracing, peak usage location, the allocation call tree — it tracks even native memory.

tracemalloc — standard library #

A lighter-weight tool that requires no extra install.

tracemalloc
import tracemalloc

tracemalloc.start()

# ... work ...

snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics("lineno")
for stat in top_stats[:10]:
    print(stat)

It shows you, line by line, where memory is most heavily held at that point in time. A nice lightweight first step.

Common CPython performance pitfalls #

1) Global vs local variables #

Frequently referencing globals inside a function is slow. Capturing them to a local once is faster.

🚫 Repeated global lookups
def process(items):
    return [math.sqrt(x) for x in items]  # math, math.sqrt looked up each time

# ✅ Bind to local
def process(items):
    sqrt = math.sqrt
    return [sqrt(x) for x in items]

A small difference, but meaningful in hot loops.

2) Accumulating strings with += #

🚫 O(n²)
result = ""
for s in strings:
    result += s    # creates a new string every time

# ✅ join — O(n)
result = "".join(strings)

Strings are immutable, so += creates a new object each time. The bigger the string, the more horrendously slow.

3) Searching a list with in #

🚫 in on a list — O(n)
if x in big_list:    # compares every element

# ✅ Convert to set — O(1)
big_set = set(big_list)
if x in big_set:

If lookup frequency is high, switch to set/dict.

4) Wrong data structure #

JobData structure
push/pop on both endscollections.deque (list’s pop(0) is O(n))
Insert while keeping sortedbisect module
Countingcollections.Counter
Priority queueheapq
dict with defaultcollections.defaultdict

Almost everything is in Python’s standard library — don’t build it, use it.

NumPy / vectorization #

For numerical computation, NumPy instead of loops is almost always faster.

🚫 Python loop
result = [a[i] * b[i] for i in range(len(a))]
✅ NumPy vectorized
import numpy as np
result = np.array(a) * np.array(b)    # parallel at the C level

Differences of 100x to 1000x are common. That said, there is a data conversion cost, so for small arrays it can be slower. Measure first, then apply.

Caching — functools.cache #

The tool you saw in Intermediate #5. The most effective optimization for pure functions called repeatedly with the same arguments.

cache
from functools import cache

@cache
def expensive(n: int) -> int:
    ...

The function must be pure, and arguments must be hashable.

__slots__ — saving instance memory #

What you saw in Intermediate #1. When you create tens of thousands of objects, this gives the biggest win.

dataclass(slots=True)
@dataclass(slots=True)
class Point:
    x: float
    y: float

40~50% memory savings per instance, 10~25% faster attribute access.

Cython / Rust extensions — the last weapon #

When pure Python isn’t enough, drop to the C level.

  • Cython — Python-like syntax compiled to C. Allows incremental conversion.
  • PyO3 (Rust) — Write extension modules in Rust. maturin is the build tool.
  • mypyc — Compiles type-hinted Python to C (mypy itself uses this approach).

The common rule: target only the hot spots. Don’t rewrite everything — moving just the narrow spots that cProfile identified gives the best cost-to-benefit ratio.

Other interpreters — worth a look #

  • PyPy — A separate implementation with a JIT compiler. Pure Python code is often 5~10x faster. Weak C-extension compatibility, so it doesn’t fit NumPy/Pandas-heavy code.
  • Free-threaded CPython (#5) — 5~10% loss on single-thread, big wins on multi-thread.

Depending on the situation, switching the interpreter itself can be the single biggest win.

Async performance #

When measuring async code from #4:

asyncio debug + profile
PYTHONASYNCIODEBUG=1 uvx py-spy@latest record -o async.svg -- python app.py

py-spy analyzes async code well too. It shows you which coroutine is blocked at which await.

In practice — performance debugging flow #

  1. A reproducible benchmark — same input, same result, same time, otherwise measurement is meaningless
  2. Check overall time with time — pick tools based on whether it’s 1 second or 1 minute
  3. Find hot spots with cProfile or py-spy
  4. Use line_profiler for line-level analysis of the hot function
  5. Check common pitfalls — list in, global lookups, string accumulation
  6. Change data structures — set/deque/Counter, etc.
  7. Vectorize — apply NumPy where possible
  8. Caching — same-arg repeated calls?
  9. C-level extensions — last resort

At each step, measure again to confirm it actually got faster. The “this will be faster” intuition is often wrong.

Wrap-up + series retrospective #

The toolbox covered in this post:

  • timeit — small-unit measurement
  • cProfile + snakeviz — function-level profile
  • py-spy — running process, low overhead
  • line_profiler — line level
  • memray + tracemalloc — memory
  • Frequent pitfalls — global lookups, string +=, list in, wrong data structures
  • Data structures: deque, bisect, Counter, heapq, defaultdict
  • NumPy vectorization, functools.cache, __slots__
  • Last resorts: Cython, PyO3, mypyc, PyPy, free-threaded CPython
  • Flow: measure → hot spots → data structures/algorithms → vectorize → cache → extend → measure again

Series retrospective #

In 7 posts, the Modern Python Advanced toolkit is filled in.

  • Magic methods — hooks where objects meet the language
  • Descriptors — turning attributes into objects
  • Metaclasses — classes that make classes (usually you don’t use them)
  • Async deep dive — event loop, Future/Task, async generator
  • GIL and concurrency — threading vs multiprocessing vs asyncio + free-threaded
  • Advanced typing — variance, ParamSpec, TypeIs, overload, Annotated
  • Performance — measurement tools and optimization patterns

That completes the 21 posts of Modern Python Basics → Intermediate → Advanced. The next series is Modern Python in Practice — Building APIs with FastAPI (6 posts). The place where every tool you’ve sharpened so far comes together in one project.

  1. Setup and start — Hello FastAPI, automatic OpenAPI generation
  2. Routing, Pydantic models, dependency injection
  3. DB integration — SQLAlchemy 2.x + Alembic
  4. Authentication — OAuth2 password flow + JWT
  5. Async and background work
  6. Testing and deployment — pytest, Docker, Railway/Fly
X