GIL and concurrency — threading vs multiprocessing vs asyncio
The identity of the GIL, the division between threading/multiprocessing/asyncio, and the free-threaded builds of Python 3.13~3.14 (PEP 703/779) — all in one place.
In Chapter 18 async in depth we noted briefly that “CPU-bound problems aren’t solved by asyncio.” That’s the topic of this chapter. It covers the GIL, the division between threading / multiprocessing / asyncio, and the free-threaded build brought by Python 3.13 ~ 3.14.
This chapter pairs with Chapter 21 performance. If Chapter 21 is about “measuring where your code’s bottleneck is,” this chapter is the decision guide for choosing a tool based on whether the bottleneck is CPU or I/O.
GIL — Global Interpreter Lock #
CPython (the standard Python implementation) has a global lock called the GIL. Only one thread can execute Python bytecode at a time.
Why does it exist? #
To make object reference counting inside CPython safe. Without a lock, touching object references concurrently would corrupt counting and break memory. It was introduced for simplicity + single-thread performance + C extension compatibility and has been kept for over 30 years.
Consequence — CPU-bound multithreading is meaningless #
import threading
def cpu_heavy(n):
total = 0
for i in range(n):
total += i ** 2
return total
threads = [threading.Thread(target=cpu_heavy, args=(10_000_000,)) for _ in range(4)]
for t in threads: t.start()
for t in threads: t.join()Doing the same job 4 times with 4 threads doesn’t make it 4 times faster. Because of the GIL it runs effectively serially. Even with 8 CPU cores, only 1 is used.
When the GIL is released — I/O #
Luckily, the GIL is released during I/O operations. socket.recv, time.sleep, file reads, DB queries, etc. let other threads proceed concurrently. That’s why I/O-bound multithreading is meaningful, while CPU-bound isn’t.
C extensions like NumPy and Pandas also release the GIL during heavy computations, so numeric work can get some multi-thread benefit. It varies by library.
Division of the three tools #
| Tool | Suitable case | Core utilization |
|---|---|---|
asyncio | I/O-bound, thousands ~ tens of thousands concurrent | 1 |
threading | I/O-bound, synchronous libraries, low concurrency | 1 |
multiprocessing | CPU-bound | N |
concurrent.futures | Unified interface for both modes | depends on mode |
threading — concurrency for sync code
#
import threading
def fetch(url):
response = requests.get(url)
return response.text
threads = []
for url in urls:
t = threading.Thread(target=fetch, args=(url,))
t.start()
threads.append(t)
for t in threads:
t.join()Pros:
- Use synchronous libraries as-is
- Minimal change to existing code
- Fast enough for I/O-bound work
Cons:
- Going up to thousands of concurrent threads incurs thread costs (memory, context switching)
- Frequent locks / shared state make debugging hard
- CPU-bound is meaningless because of the GIL
Lock, RLock, Semaphore, Event
#
counter = 0
lock = threading.Lock()
def increment():
global counter
with lock:
counter += 1+= is not atomic (load → add → store). Without a lock there’s a race condition. Async doesn’t have this issue (one coroutine at a time), but threads do.
concurrent.futures.ThreadPoolExecutor — a more convenient interface
#
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor(max_workers=10) as pool:
results = list(pool.map(fetch, urls))You rarely create Thread objects directly. ThreadPoolExecutor is the standard tool for this job.
multiprocessing — real parallelism
#
CPU-bound work goes to separate processes. Each process has its own GIL, so real parallel execution happens.
from concurrent.futures import ProcessPoolExecutor
def cpu_heavy(n):
return sum(i ** 2 for i in range(n))
if __name__ == "__main__":
with ProcessPoolExecutor(max_workers=8) as pool:
results = list(pool.map(cpu_heavy, [10**7] * 8))8 processes use 8 cores concurrently. Nearly 8x speed-up for real.
Cost — process creation and IPC #
- Process creation is more expensive than threads
- Data transfer goes through serialization / deserialization (pickle) — large data transfer is expensive
- Debugging becomes harder (process isolation)
So batch up heavy computation before sending it. Sending small jobs frequently makes IPC cost exceed computation.
if __name__ == "__main__": — required
#
def worker(x): ...
if __name__ == "__main__":
with ProcessPoolExecutor() as pool:
pool.map(worker, [1, 2, 3])multiprocessing re-imports the module when creating child processes. If calls like pool.map(...) also run in the child, you get infinite recursion. The guard is required.
Shared state — Queue, Manager, shared_memory
#
Sharing data between processes is tricky.
from multiprocessing import Queue, Process
def worker(q):
q.put("hello from worker")
if __name__ == "__main__":
q = Queue()
p = Process(target=worker, args=(q,))
p.start()
print(q.get()) # hello from worker
p.join()multiprocessing.Queue— process-safe queuemultiprocessing.Manager— share dict / list via proxiesmultiprocessing.shared_memory(3.8+) — share large numpy arrays etc. without copy
asyncio — once more
#
The async from Chapter 14 / Chapter 18 is single-thread + cooperative yield.
| Item | asyncio |
|---|---|
| Strength | tens of thousands concurrent, low memory, explicit yield points |
| Weakness | needs async libraries, no effect on CPU-bound work |
Mixing asyncio and threading #
asyncio.to_thread to safely call a sync function from async code.
import asyncio
async def fetch(url):
return await asyncio.to_thread(requests.get, url)Mixing asyncio and multiprocessing #
import asyncio
from concurrent.futures import ProcessPoolExecutor
async def main():
loop = asyncio.get_running_loop()
with ProcessPoolExecutor() as pool:
result = await loop.run_in_executor(pool, cpu_heavy, 10_000_000)Send CPU-bound work to a separate process, await the result asynchronously.
Free-threaded — the big change in Python 3.13 ~ 3.14 #
PEP 703 introduced a build without the GIL. Experimental in 3.13, promoted to officially supported in 3.14 via PEP 779.
What changes #
- The GIL goes away — CPU-bound multithreading actually works
- Existing synchronous code, run multi-threaded, auto-accelerates by core count
- A new concurrency model (single process + real multi-threading) becomes possible
Cost — single-thread performance #
The GIL, being simple, was also a tool that boosted single-thread performance. Removing it costs a bit of single-thread performance. As of 3.14, about 5 ~ 10% loss. The trend is to shrink over time.
How to use it #
# install the free-threaded build with uv
uv python install 3.14t
# (t stands for free-threaded build)
# make the project use it
uv init my-app --python 3.14tLibrary compatibility #
Many C extension libraries have code that assumes the GIL, so compatibility migration is underway. NumPy, PyTorch, Pillow and other major libraries have either completed or are in the middle of free-threaded compatibility work. For new projects it’s worth trying free-threaded from the start, but projects with legacy dependencies must verify compatibility.
sub-interpreter — PEP 734 (3.14) #
Another direction for avoiding the GIL. A model where multiple interpreters live inside one process, each with its own GIL.
from concurrent.futures import InterpreterPoolExecutor
with InterpreterPoolExecutor(max_workers=4) as pool:
results = list(pool.map(cpu_heavy, [10**7] * 4))A middle ground — lighter than multiprocessing and safer in compatibility than free-threaded. New as of 3.14, but likely to settle as an important option for CPU-bound concurrency.
Decision guide — one table #
| Job | First try |
|---|---|
| 1000 HTTP requests concurrently | asyncio (httpx) |
| 50 HTTP requests, sync library in use | ThreadPoolExecutor |
| Heavy numeric computation across 8 cores | ProcessPoolExecutor |
| Heavy numpy computation that releases the GIL | ThreadPoolExecutor (or numpy’s own parallelism) |
| New project, simultaneous CPU + I/O usage | 3.14 free-threaded + threading |
| Stable isolation, large shared data | multiprocessing + shared_memory |
| Multiple CPU-bound jobs in a single process | InterpreterPoolExecutor (3.14+) |
Common traps #
1) Mixing time.sleep and asyncio.sleep
#
async def fetch(url):
time.sleep(1) # holds GIL + freezes event loop
...Seen in Chapter 14. Always await asyncio.sleep.
2) print isn’t thread-safe #
def worker(i):
print(f"start {i}")
do_work()
print(f"end {i}")When multiple threads print concurrently, lines may interleave. The logging module is thread-safe — use that. Chapter 31 logging and observability covers production logging setup.
3) Frequently sending NumPy arrays via multiprocessing #
with ProcessPoolExecutor() as pool:
pool.map(process, [big_numpy_array] * 100)Pickle-serialize the array → send to child → child deserializes. IPC cost exceeds computation cost. Consider shared_memory or numpy’s own parallelism (BLAS / LAPACK multi-threading).
4) Deadlock #
Acquiring multiple locks in different orders causes deadlock.
def t1():
with lock_a:
with lock_b: # deadlock if t2 holds b and waits for a
...
def t2():
with lock_b:
with lock_a:
...Rule: keep lock acquisition order consistent across all code. Or use threading.RLock for a re-entrant lock.
In practice — how to start #
1. Get it working in synchronous code first
2. Measure where the bottleneck is (covered in Chapter 21 [performance](./performance/))
3. Bottleneck I/O-bound → asyncio or ThreadPoolExecutor
4. Bottleneck CPU-bound → ProcessPoolExecutor or free-threaded
5. If none of the above works, swap the library itself for a faster one (Cython, Rust extension)Premature optimization applies here too. If synchronous code is fast enough, avoiding concurrency is simpler and safer.
Exercises #
- Apply a
cpu_heavy(n)function (e.g.,sum(i ** 2 for i in range(n))) to the same 8 inputs in three ways: (1) synchronous serial, (2)ThreadPoolExecutor(8), (3)ProcessPoolExecutor(8), and measure times. The effect of the GIL becomes visible directly. - Fetch 100 URLs with the synchronous
requests.get(url)library in two ways: (1) synchronous serial, (2)ThreadPoolExecutor(20), and compare. Verify that threads are effective for I/O-bound work. - Write the same job with
httpx.AsyncClient+asyncio.gatherand compare with (2) above. As concurrency goes up to 1000, hypothesize which model holds up better and measure.
In one line: The GIL is CPython’s global lock — one thread at a time runs bytecode. The GIL is released during I/O so threading is effective; CPU-bound goes to multiprocessing / 3.14 free-threaded.
asynciois single-thread, tens of thousands concurrent, can’t solve CPU-bound. Division:ThreadPoolExecutor(I/O sync) /ProcessPoolExecutor(CPU) /asyncio(large-scale I/O async) / 3.14+ free-threaded (CPU concurrency, watch compatibility). Start sync → measure → pick by bottleneck.
Next chapter #
Next, Chapter 20 advanced typing — Variance, ParamSpec, Self, overload covers the next step from Chapter 9 typing in earnest — Variance, ParamSpec, Self, TypeGuard / TypeIs, overload, and the harder parts of typing.