Modern Python Advanced #7 Performance — cProfile, py-spy, Memory Profiling

Programming Language Python profiling performance

Thursday, April 30, 2026

8 min read

The last post of the advanced series — performance. When you get a “this is slow” report, here is the toolbox for measuring where and how it’s slow, and fixing it: timeit, cProfile, py-spy, line_profiler, memray, and common optimization patterns.

First rule — don’t optimize without measuring #

A famous quote

"Premature optimization is the root of all evil." — Donald Knuth

It always sounds a little tired to read, but it’s almost always right. When you guess “this part is going to be slow” by intuition, you’re wrong about 70% of the time. Measurement is step one.

`timeit` — measuring small units #

timeit

import timeit

# Time a one-liner
t = timeit.timeit("sum(range(1000))", number=10_000)
print(f"average {t / 10_000 * 1e6:.2f} μs/run")

# With setup code
t = timeit.timeit(
    stmt="d.get('key')",
    setup="d = {'key': 1}",
    number=1_000_000,
)

Useful for comparing small units — “is a list comprehension faster than map,” “is an f-string faster than +,” that kind of question.

It also works from the CLI:

CLI

python -m timeit -s "import json" "json.dumps({'a': 1})"
# 1000000 loops, best of 5: 322 ns per loop

`cProfile` — function-level profiling #

Shows where CPU time is spent, per function.

Running cProfile

python -m cProfile -s cumulative myapp.py
# sorted by cumulative time

Or from code:

From code

import cProfile
import pstats

with cProfile.Profile() as pr:
    do_work()

stats = pstats.Stats(pr).sort_stats("cumulative")
stats.print_stats(20)    # top 20

Output:

cProfile output

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001    2.345    2.345 myapp.py:10(main)
     1000    0.500    0.001    1.800    0.002 myapp.py:50(process_item)
   100000    0.700    0.000    0.700    0.000 myapp.py:80(parse_line)

How to read it:

tottime — time spent directly in the body of that function (excluding child calls)
cumtime — cumulative time of that function plus all its children
ncalls — number of calls

Hot-spot candidates: large tottime, or the parent of a function with large cumtime.

Visualization — snakeviz #

snakeviz

uv add --dev snakeviz
python -m cProfile -o profile.out myapp.py
uvx snakeviz profile.out

You see the call tree as a flame-graph-like view in the browser. Much more intuitive than text output.

`py-spy` — profiling running processes #

cProfile’s downside: you have to modify the code to wrap it. When you want to attach to a running production process, py-spy is the answer.

py-spy

uvx py-spy@latest top --pid 12345
# or start a new process
uvx py-spy@latest record -o flame.svg -- python myapp.py

top mode: real-time per-function CPU usage (like the top command). record mode: record for a duration and emit a flame graph SVG.

Why py-spy is valuable:

No source modification needed
Sampling-based — very low overhead (5~10%)
Shows C extensions — can analyze NumPy internals, etc.
GIL hold time is shown too — --idle option for idle analysis

A tool for seeing “what’s slow right now” in production / staging on the fly.

`line_profiler` — line-level profiling #

cProfile is per-function. When you want to see which line inside a function is slow.

line_profiler

uv add --dev line_profiler

Attach @profile (injected by line_profiler) to the target function.

Target function

@profile
def process(items):
    parsed = [parse(x) for x in items]    # measure each line
    filtered = [x for x in parsed if x.valid]
    return filtered

Run

uv run kernprof -l -v myapp.py

Output:

line_profiler output

Line #      Hits         Time  Per Hit  % Time  Line Contents
==============================================================
     2         1     1234567.0  1234567.0   85.3      parsed = [parse(x) for x in items]
     3         1      200000.0   200000.0   13.8      filtered = [x for x in parsed if x.valid]
     4         1       12000.0    12000.0    0.8      return filtered

You see at a glance which line in the function dominates. Very useful for detailed optimization. Note that the measurement overhead is high (it inserts instrumentation), so use it after you’ve narrowed down the hot spot.

Memory profiling — `memray` #

You should measure memory as often as you measure CPU. Bloomberg’s memray is the go-to tool for this.

memray

uv add --dev memray
uv run memray run myapp.py     # produces *.bin
uv run memray flamegraph output.bin  # HTML report

Memory leak tracing, peak usage location, the allocation call tree — it tracks even native memory.

`tracemalloc` — standard library #

A lighter-weight tool that requires no extra install.

tracemalloc

import tracemalloc

tracemalloc.start()

# ... work ...

snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics("lineno")
for stat in top_stats[:10]:
    print(stat)

It shows you, line by line, where memory is most heavily held at that point in time. A nice lightweight first step.

Common CPython performance pitfalls #

1) Global vs local variables #

Frequently referencing globals inside a function is slow. Capturing them to a local once is faster.

🚫 Repeated global lookups

def process(items):
    return [math.sqrt(x) for x in items]  # math, math.sqrt looked up each time

# ✅ Bind to local
def process(items):
    sqrt = math.sqrt
    return [sqrt(x) for x in items]

A small difference, but meaningful in hot loops.

2) Accumulating strings with `+=` #

🚫 O(n²)

result = ""
for s in strings:
    result += s    # creates a new string every time

# ✅ join — O(n)
result = "".join(strings)

Strings are immutable, so += creates a new object each time. The bigger the string, the more horrendously slow.

3) Searching a list with `in` #

🚫 in on a list — O(n)

if x in big_list:    # compares every element

# ✅ Convert to set — O(1)
big_set = set(big_list)
if x in big_set:

If lookup frequency is high, switch to set/dict.

4) Wrong data structure #

Job	Data structure
push/pop on both ends	`collections.deque` (list’s `pop(0)` is O(n))
Insert while keeping sorted	`bisect` module
Counting	`collections.Counter`
Priority queue	`heapq`
dict with default	`collections.defaultdict`

Almost everything is in Python’s standard library — don’t build it, use it.

NumPy / vectorization #

For numerical computation, NumPy instead of loops is almost always faster.

🚫 Python loop

result = [a[i] * b[i] for i in range(len(a))]

✅ NumPy vectorized

import numpy as np
result = np.array(a) * np.array(b)    # parallel at the C level

Differences of 100x to 1000x are common. That said, there is a data conversion cost, so for small arrays it can be slower. Measure first, then apply.

Caching — `functools.cache` #

The tool you saw in Intermediate #5. The most effective optimization for pure functions called repeatedly with the same arguments.

cache

from functools import cache

@cache
def expensive(n: int) -> int:
    ...

The function must be pure, and arguments must be hashable.

`slots` — saving instance memory #

What you saw in Intermediate #1. When you create tens of thousands of objects, this gives the biggest win.

dataclass(slots=True)

@dataclass(slots=True)
class Point:
    x: float
    y: float

40~50% memory savings per instance, 10~25% faster attribute access.

Cython / Rust extensions — the last weapon #

When pure Python isn’t enough, drop to the C level.

Cython — Python-like syntax compiled to C. Allows incremental conversion.
PyO3 (Rust) — Write extension modules in Rust. maturin is the build tool.
mypyc — Compiles type-hinted Python to C (mypy itself uses this approach).

The common rule: target only the hot spots. Don’t rewrite everything — moving just the narrow spots that cProfile identified gives the best cost-to-benefit ratio.

Other interpreters — worth a look #

PyPy — A separate implementation with a JIT compiler. Pure Python code is often 5~10x faster. Weak C-extension compatibility, so it doesn’t fit NumPy/Pandas-heavy code.
Free-threaded CPython (#5) — 5~10% loss on single-thread, big wins on multi-thread.

Depending on the situation, switching the interpreter itself can be the single biggest win.

Async performance #

When measuring async code from #4:

asyncio debug + profile

PYTHONASYNCIODEBUG=1 uvx py-spy@latest record -o async.svg -- python app.py

py-spy analyzes async code well too. It shows you which coroutine is blocked at which await.

In practice — performance debugging flow #

A reproducible benchmark — same input, same result, same time, otherwise measurement is meaningless
Check overall time with time — pick tools based on whether it’s 1 second or 1 minute
Find hot spots with cProfile or py-spy
Use line_profiler for line-level analysis of the hot function
Check common pitfalls — list in, global lookups, string accumulation
Change data structures — set/deque/Counter, etc.
Vectorize — apply NumPy where possible
Caching — same-arg repeated calls?
C-level extensions — last resort

At each step, measure again to confirm it actually got faster. The “this will be faster” intuition is often wrong.

Wrap-up + series retrospective #

The toolbox covered in this post:

timeit — small-unit measurement
cProfile + snakeviz — function-level profile
py-spy — running process, low overhead
line_profiler — line level
memray + tracemalloc — memory
Frequent pitfalls — global lookups, string +=, list in, wrong data structures
Data structures: deque, bisect, Counter, heapq, defaultdict
NumPy vectorization, functools.cache, __slots__
Last resorts: Cython, PyO3, mypyc, PyPy, free-threaded CPython
Flow: measure → hot spots → data structures/algorithms → vectorize → cache → extend → measure again

Series retrospective #

In 7 posts, the Modern Python Advanced toolkit is filled in.

Magic methods — hooks where objects meet the language
Descriptors — turning attributes into objects
Metaclasses — classes that make classes (usually you don’t use them)
Async deep dive — event loop, Future/Task, async generator
GIL and concurrency — threading vs multiprocessing vs asyncio + free-threaded
Advanced typing — variance, ParamSpec, TypeIs, overload, Annotated
Performance — measurement tools and optimization patterns

That completes the 21 posts of Modern Python Basics → Intermediate → Advanced. The next series is Modern Python in Practice — Building APIs with FastAPI (6 posts). The place where every tool you’ve sharpened so far comes together in one project.

Setup and start — Hello FastAPI, automatic OpenAPI generation
Routing, Pydantic models, dependency injection
DB integration — SQLAlchemy 2.x + Alembic
Authentication — OAuth2 password flow + JWT
Async and background work
Testing and deployment — pytest, Docker, Railway/Fly

First rule — don’t optimize without measuring #

timeit — measuring small units #

cProfile — function-level profiling #

Visualization — snakeviz #

py-spy — profiling running processes #

line_profiler — line-level profiling #

Memory profiling — memray #

tracemalloc — standard library #

Common CPython performance pitfalls #

1) Global vs local variables #

2) Accumulating strings with += #

3) Searching a list with in #

4) Wrong data structure #

NumPy / vectorization #

Caching — functools.cache #

__slots__ — saving instance memory #

Cython / Rust extensions — the last weapon #

Other interpreters — worth a look #

Async performance #

In practice — performance debugging flow #

Wrap-up + series retrospective #

Series retrospective #

`timeit` — measuring small units #

`cProfile` — function-level profiling #

`py-spy` — profiling running processes #

`line_profiler` — line-level profiling #

Memory profiling — `memray` #

`tracemalloc` — standard library #

2) Accumulating strings with `+=` #

3) Searching a list with `in` #

Caching — `functools.cache` #

`slots` — saving instance memory #