Contents
19 Chapter

GIL and concurrency — threading vs multiprocessing vs asyncio

The identity of the GIL, the division between threading/multiprocessing/asyncio, and the free-threaded builds of Python 3.13~3.14 (PEP 703/779) — all in one place.

In Chapter 18 async in depth we noted briefly that “CPU-bound problems aren’t solved by asyncio.” That’s the topic of this chapter. It covers the GIL, the division between threading / multiprocessing / asyncio, and the free-threaded build brought by Python 3.13 ~ 3.14.

This chapter pairs with Chapter 21 performance. If Chapter 21 is about “measuring where your code’s bottleneck is,” this chapter is the decision guide for choosing a tool based on whether the bottleneck is CPU or I/O.

GIL — Global Interpreter Lock #

CPython (the standard Python implementation) has a global lock called the GIL. Only one thread can execute Python bytecode at a time.

Why does it exist? #

To make object reference counting inside CPython safe. Without a lock, touching object references concurrently would corrupt counting and break memory. It was introduced for simplicity + single-thread performance + C extension compatibility and has been kept for over 30 years.

Consequence — CPU-bound multithreading is meaningless #

🚫 Multi-threaded but not faster
import threading

def cpu_heavy(n):
    total = 0
    for i in range(n):
        total += i ** 2
    return total

threads = [threading.Thread(target=cpu_heavy, args=(10_000_000,)) for _ in range(4)]
for t in threads: t.start()
for t in threads: t.join()

Doing the same job 4 times with 4 threads doesn’t make it 4 times faster. Because of the GIL it runs effectively serially. Even with 8 CPU cores, only 1 is used.

When the GIL is released — I/O #

Luckily, the GIL is released during I/O operations. socket.recv, time.sleep, file reads, DB queries, etc. let other threads proceed concurrently. That’s why I/O-bound multithreading is meaningful, while CPU-bound isn’t.

C extensions like NumPy and Pandas also release the GIL during heavy computations, so numeric work can get some multi-thread benefit. It varies by library.

Division of the three tools #

ToolSuitable caseCore utilization
asyncioI/O-bound, thousands ~ tens of thousands concurrent1
threadingI/O-bound, synchronous libraries, low concurrency1
multiprocessingCPU-boundN
concurrent.futuresUnified interface for both modesdepends on mode

threading — concurrency for sync code #

threading basics
import threading

def fetch(url):
    response = requests.get(url)
    return response.text

threads = []
for url in urls:
    t = threading.Thread(target=fetch, args=(url,))
    t.start()
    threads.append(t)

for t in threads:
    t.join()

Pros:

  • Use synchronous libraries as-is
  • Minimal change to existing code
  • Fast enough for I/O-bound work

Cons:

  • Going up to thousands of concurrent threads incurs thread costs (memory, context switching)
  • Frequent locks / shared state make debugging hard
  • CPU-bound is meaningless because of the GIL

Lock, RLock, Semaphore, Event #

Protect shared state
counter = 0
lock = threading.Lock()

def increment():
    global counter
    with lock:
        counter += 1

+= is not atomic (load → add → store). Without a lock there’s a race condition. Async doesn’t have this issue (one coroutine at a time), but threads do.

concurrent.futures.ThreadPoolExecutor — a more convenient interface #

ThreadPoolExecutor
from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor(max_workers=10) as pool:
    results = list(pool.map(fetch, urls))

You rarely create Thread objects directly. ThreadPoolExecutor is the standard tool for this job.

multiprocessing — real parallelism #

CPU-bound work goes to separate processes. Each process has its own GIL, so real parallel execution happens.

multiprocessing basics
from concurrent.futures import ProcessPoolExecutor

def cpu_heavy(n):
    return sum(i ** 2 for i in range(n))

if __name__ == "__main__":
    with ProcessPoolExecutor(max_workers=8) as pool:
        results = list(pool.map(cpu_heavy, [10**7] * 8))

8 processes use 8 cores concurrently. Nearly 8x speed-up for real.

Cost — process creation and IPC #

  • Process creation is more expensive than threads
  • Data transfer goes through serialization / deserialization (pickle) — large data transfer is expensive
  • Debugging becomes harder (process isolation)

So batch up heavy computation before sending it. Sending small jobs frequently makes IPC cost exceed computation.

if __name__ == "__main__": — required #

Entry-point guard required
def worker(x): ...

if __name__ == "__main__":
    with ProcessPoolExecutor() as pool:
        pool.map(worker, [1, 2, 3])

multiprocessing re-imports the module when creating child processes. If calls like pool.map(...) also run in the child, you get infinite recursion. The guard is required.

Shared state — Queue, Manager, shared_memory #

Sharing data between processes is tricky.

Shared queue
from multiprocessing import Queue, Process

def worker(q):
    q.put("hello from worker")

if __name__ == "__main__":
    q = Queue()
    p = Process(target=worker, args=(q,))
    p.start()
    print(q.get())   # hello from worker
    p.join()
  • multiprocessing.Queue — process-safe queue
  • multiprocessing.Manager — share dict / list via proxies
  • multiprocessing.shared_memory (3.8+) — share large numpy arrays etc. without copy

asyncio — once more #

The async from Chapter 14 / Chapter 18 is single-thread + cooperative yield.

Itemasyncio
Strengthtens of thousands concurrent, low memory, explicit yield points
Weaknessneeds async libraries, no effect on CPU-bound work

Mixing asyncio and threading #

asyncio.to_thread to safely call a sync function from async code.

Mixing
import asyncio

async def fetch(url):
    return await asyncio.to_thread(requests.get, url)

Mixing asyncio and multiprocessing #

async + process pool
import asyncio
from concurrent.futures import ProcessPoolExecutor

async def main():
    loop = asyncio.get_running_loop()
    with ProcessPoolExecutor() as pool:
        result = await loop.run_in_executor(pool, cpu_heavy, 10_000_000)

Send CPU-bound work to a separate process, await the result asynchronously.

Free-threaded — the big change in Python 3.13 ~ 3.14 #

PEP 703 introduced a build without the GIL. Experimental in 3.13, promoted to officially supported in 3.14 via PEP 779.

What changes #

  • The GIL goes away — CPU-bound multithreading actually works
  • Existing synchronous code, run multi-threaded, auto-accelerates by core count
  • A new concurrency model (single process + real multi-threading) becomes possible

Cost — single-thread performance #

The GIL, being simple, was also a tool that boosted single-thread performance. Removing it costs a bit of single-thread performance. As of 3.14, about 5 ~ 10% loss. The trend is to shrink over time.

How to use it #

Using the free-threaded build
# install the free-threaded build with uv
uv python install 3.14t
# (t stands for free-threaded build)

# make the project use it
uv init my-app --python 3.14t

Library compatibility #

Many C extension libraries have code that assumes the GIL, so compatibility migration is underway. NumPy, PyTorch, Pillow and other major libraries have either completed or are in the middle of free-threaded compatibility work. For new projects it’s worth trying free-threaded from the start, but projects with legacy dependencies must verify compatibility.

sub-interpreter — PEP 734 (3.14) #

Another direction for avoiding the GIL. A model where multiple interpreters live inside one process, each with its own GIL.

sub-interpreter (3.14+)
from concurrent.futures import InterpreterPoolExecutor

with InterpreterPoolExecutor(max_workers=4) as pool:
    results = list(pool.map(cpu_heavy, [10**7] * 4))

A middle ground — lighter than multiprocessing and safer in compatibility than free-threaded. New as of 3.14, but likely to settle as an important option for CPU-bound concurrency.

Decision guide — one table #

JobFirst try
1000 HTTP requests concurrentlyasyncio (httpx)
50 HTTP requests, sync library in useThreadPoolExecutor
Heavy numeric computation across 8 coresProcessPoolExecutor
Heavy numpy computation that releases the GILThreadPoolExecutor (or numpy’s own parallelism)
New project, simultaneous CPU + I/O usage3.14 free-threaded + threading
Stable isolation, large shared datamultiprocessing + shared_memory
Multiple CPU-bound jobs in a single processInterpreterPoolExecutor (3.14+)

Common traps #

1) Mixing time.sleep and asyncio.sleep #

🚫 Sync sleep inside async
async def fetch(url):
    time.sleep(1)    # holds GIL + freezes event loop
    ...

Seen in Chapter 14. Always await asyncio.sleep.

2) print isn’t thread-safe #

thread + print
def worker(i):
    print(f"start {i}")
    do_work()
    print(f"end {i}")

When multiple threads print concurrently, lines may interleave. The logging module is thread-safe — use that. Chapter 31 logging and observability covers production logging setup.

3) Frequently sending NumPy arrays via multiprocessing #

🚫 Pickling a big array every time
with ProcessPoolExecutor() as pool:
    pool.map(process, [big_numpy_array] * 100)

Pickle-serialize the array → send to child → child deserializes. IPC cost exceeds computation cost. Consider shared_memory or numpy’s own parallelism (BLAS / LAPACK multi-threading).

4) Deadlock #

Acquiring multiple locks in different orders causes deadlock.

🚫 Deadlock
def t1():
    with lock_a:
        with lock_b:    # deadlock if t2 holds b and waits for a
            ...

def t2():
    with lock_b:
        with lock_a:
            ...

Rule: keep lock acquisition order consistent across all code. Or use threading.RLock for a re-entrant lock.

In practice — how to start #

Recommended steps
1. Get it working in synchronous code first
2. Measure where the bottleneck is (covered in Chapter 21 [performance](./performance/))
3. Bottleneck I/O-bound → asyncio or ThreadPoolExecutor
4. Bottleneck CPU-bound → ProcessPoolExecutor or free-threaded
5. If none of the above works, swap the library itself for a faster one (Cython, Rust extension)

Premature optimization applies here too. If synchronous code is fast enough, avoiding concurrency is simpler and safer.

Exercises #

  1. Apply a cpu_heavy(n) function (e.g., sum(i ** 2 for i in range(n))) to the same 8 inputs in three ways: (1) synchronous serial, (2) ThreadPoolExecutor(8), (3) ProcessPoolExecutor(8), and measure times. The effect of the GIL becomes visible directly.
  2. Fetch 100 URLs with the synchronous requests.get(url) library in two ways: (1) synchronous serial, (2) ThreadPoolExecutor(20), and compare. Verify that threads are effective for I/O-bound work.
  3. Write the same job with httpx.AsyncClient + asyncio.gather and compare with (2) above. As concurrency goes up to 1000, hypothesize which model holds up better and measure.

In one line: The GIL is CPython’s global lock — one thread at a time runs bytecode. The GIL is released during I/O so threading is effective; CPU-bound goes to multiprocessing / 3.14 free-threaded. asyncio is single-thread, tens of thousands concurrent, can’t solve CPU-bound. Division: ThreadPoolExecutor (I/O sync) / ProcessPoolExecutor (CPU) / asyncio (large-scale I/O async) / 3.14+ free-threaded (CPU concurrency, watch compatibility). Start sync → measure → pick by bottleneck.

Next chapter #

Next, Chapter 20 advanced typing — Variance, ParamSpec, Self, overload covers the next step from Chapter 9 typing in earnest — Variance, ParamSpec, Self, TypeGuard / TypeIs, overload, and the harder parts of typing.

X