Modern Python Advanced #5: GIL and concurrency — threading vs multiprocessing vs asyncio
In #4 Async in depth we briefly noted that “CPU-bound work isn’t solved by asyncio.” This is that post. Here we cover what the GIL is, the division of labor between threading/multiprocessing/asyncio, and the free-threaded build introduced in Python 3.13–3.14.
GIL — Global Interpreter Lock #
CPython (the standard Python implementation) has a global lock called the GIL. Only one thread can execute Python bytecode at a time.
Why it exists #
To safely manage CPython’s internal object reference counting. Without a lock, simultaneous reference manipulations would corrupt counters and break memory. Introduced for simplicity + single-thread performance + C-extension compatibility and kept for over 30 years.
Result — multithreading is meaningless for CPU-bound work #
import threading
def cpu_heavy(n):
total = 0
for i in range(n):
total += i ** 2
return total
threads = [threading.Thread(target=cpu_heavy, args=(10_000_000,)) for _ in range(4)]
for t in threads: t.start()
for t in threads: t.join()Doing the same job four times with four threads doesn’t go 4x faster. GIL forces effectively serial execution. Even with 8 CPU cores, only 1 is used.
Where the GIL releases — I/O #
Fortunately, the GIL is released during I/O operations. socket.recv, time.sleep, file reads, DB queries — other threads can progress concurrently. So multithreading is meaningful for I/O-bound work but not for CPU-bound.
C extensions like NumPy and Pandas often release the GIL during heavy computations, so numerical work gets some multithreading benefit. Varies by library.
The role of the three tools #
| Tool | Fits | Cores used |
|---|---|---|
asyncio | I/O-bound, thousands–tens of thousands concurrent | 1 |
threading | I/O-bound, sync libraries, low concurrency | 1 |
multiprocessing | CPU-bound | N |
concurrent.futures | unified interface for both modes | depends |
threading — concurrency for sync code
#
import threading
def fetch(url):
response = requests.get(url)
return response.text
threads = []
for url in urls:
t = threading.Thread(target=fetch, args=(url,))
t.start()
threads.append(t)
for t in threads:
t.join()Pros:
- Use sync libraries as-is
- Minimal change to existing code
- Fast enough for I/O-bound
Cons:
- At thousands of concurrency, thread cost itself (memory, context switches)
- Hard to debug with frequent locks/shared state
- CPU-bound is meaningless due to the GIL
Lock, RLock, Semaphore, Event
#
counter = 0
lock = threading.Lock()
def increment():
global counter
with lock:
counter += 1+= isn’t atomic (load → add → store). Without a lock, race conditions occur. Async doesn’t have this issue (one coroutine at a time), but threads do.
concurrent.futures.ThreadPoolExecutor — easier interface
#
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor(max_workers=10) as pool:
results = list(pool.map(fetch, urls))You rarely build Thread objects directly. ThreadPoolExecutor is the standard answer.
multiprocessing — true parallelism
#
For CPU-bound work, send to separate processes. Each process has its own GIL, enabling true parallel execution.
from concurrent.futures import ProcessPoolExecutor
def cpu_heavy(n):
return sum(i ** 2 for i in range(n))
if __name__ == "__main__":
with ProcessPoolExecutor(max_workers=8) as pool:
results = list(pool.map(cpu_heavy, [10**7] * 8))Eight processes use eight cores simultaneously. Almost 8x speedup.
Cost — process creation and IPC #
- Process creation is more expensive than threads
- Data transfer goes through serialization/deserialization (pickle) — large data is expensive
- Debugging gets harder (process isolation)
So batching heavy computations is what makes it effective. Sending small jobs frequently lets IPC cost exceed the actual work.
if __name__ == "__main__": — required
#
def worker(x): ...
if __name__ == "__main__":
with ProcessPoolExecutor() as pool:
pool.map(worker, [1, 2, 3])multiprocessing re-imports the module when creating child processes. If pool.map(...) runs again in the child, you get infinite recursion. The guard is required.
Shared state — Queue, Manager, shared_memory
#
Sharing data across processes is tricky.
from multiprocessing import Queue, Process
def worker(q):
q.put("hello from worker")
if __name__ == "__main__":
q = Queue()
p = Process(target=worker, args=(q,))
p.start()
print(q.get()) # hello from worker
p.join()multiprocessing.Queue— a process-safe queuemultiprocessing.Manager— share dicts/lists via proxiesmultiprocessing.shared_memory(3.8+) — share large arrays like numpy without copying
asyncio — once more
#
The async from #4 is single-thread + cooperative yielding.
| Aspect | asyncio |
|---|---|
| Strength | tens of thousands of concurrency, low memory, explicit yield points |
| Weakness | needs async libraries; no help for CPU-bound |
Mixing asyncio with threading #
asyncio.to_thread for safely calling sync functions from async.
import asyncio
async def fetch(url):
return await asyncio.to_thread(requests.get, url)Mixing asyncio with multiprocessing #
import asyncio
from concurrent.futures import ProcessPoolExecutor
async def main():
loop = asyncio.get_running_loop()
with ProcessPoolExecutor() as pool:
result = await loop.run_in_executor(pool, cpu_heavy, 10_000_000)Send CPU-bound work to a separate process; await the result asynchronously.
Free-threaded — the big change in Python 3.13–3.14 #
PEP 703 introduced a GIL-less build. Experimental in 3.13, officially supported via PEP 779 in 3.14.
What changes #
- The GIL is gone — CPU-bound multithreading actually works
- Existing sync code in many threads automatically scales to core count
- New concurrency models (single process + true multithreading) become possible
Cost — single-thread performance #
The GIL also kept single-thread performance simple and fast. Removing it costs some single-thread speed. As of 3.14, about 5–10% loss. Trending down with time.
How to use #
# Install free-threaded build with uv
uv python install 3.14t
# (t = free-threaded build)
# Have your project use it
uv init my-app --python 3.14tLibrary compatibility #
Many C-extension libraries assume the GIL, so compatibility migration is in progress. Major libraries like NumPy, PyTorch, and Pillow have done or are doing free-threaded compatibility. For new projects, free-threaded is worth trying first; for projects with legacy dependencies, compatibility checks are essential.
Sub-interpreter — PEP 734 (3.14) #
Another direction for avoiding the GIL. Run multiple interpreters in one process, each with its own GIL.
from concurrent.futures import InterpreterPoolExecutor
with InterpreterPoolExecutor(max_workers=4) as pool:
results = list(pool.map(cpu_heavy, [10**7] * 4))Lighter than multiprocessing and safer for compatibility than free-threaded — a middle ground. Still new in 3.14, but likely to become an important option for CPU-bound concurrency.
Decision guide — one table #
| Job | First try |
|---|---|
| 1000 concurrent HTTP requests | asyncio (httpx) |
| 50 HTTP requests, using a sync library | ThreadPoolExecutor |
| Heavy numerical work using 8 cores | ProcessPoolExecutor |
| Heavy numpy work that releases the GIL | ThreadPoolExecutor (or numpy’s own parallelism) |
| New project, mix CPU + I/O | 3.14 free-threaded + threading |
| Stable isolation, share large data | multiprocessing + shared_memory |
| Multiple CPU-bound in one process | InterpreterPoolExecutor (3.14+) |
Common pitfalls #
1) Mixing time.sleep with asyncio.sleep
#
async def fetch(url):
time.sleep(1) # holds GIL + freezes the event loop
...Saw this in Intermediate #7. Always await asyncio.sleep.
2) print isn’t thread-safe
#
def worker(i):
print(f"start {i}")
do_work()
print(f"end {i}")When threads print at the same time, lines may interleave. The logging module is thread-safe — use it.
3) Sending NumPy arrays through multiprocessing often #
with ProcessPoolExecutor() as pool:
pool.map(process, [big_numpy_array] * 100)Serialize the array (pickle) → send to child → child deserializes. IPC cost exceeds the computation. Consider shared_memory or numpy’s own parallelism (BLAS/LAPACK multithreading).
4) Deadlocks #
Acquiring multiple locks in different orders causes deadlocks.
def t1():
with lock_a:
with lock_b: # if t2 holds b and waits for a, deadlock
...
def t2():
with lock_b:
with lock_a:
...Rule: keep lock acquisition order consistent across the codebase. Or use threading.RLock for reentrant locks.
Practical advice — how to start #
1. Get the sync code working first
2. Measure the bottleneck (covered in #7)
3. I/O-bound bottleneck → asyncio or ThreadPoolExecutor
4. CPU-bound bottleneck → ProcessPoolExecutor or free-threaded
5. None of those? → faster library/Cython/Rust extensionPremature optimization applies here too. If sync code is fast enough, not using concurrency is the right answer.
Wrap-up #
What this post covered:
- GIL — CPython’s global lock; only one thread executes bytecode at a time
- I/O releases the GIL — multithreading is meaningful; CPU-bound isn’t
asyncio(single thread, large-scale concurrency),threading(I/O + sync code),multiprocessing(CPU-bound)- ThreadPoolExecutor / ProcessPoolExecutor in
concurrent.futuresare the standard - multiprocessing has IPC cost — bundle heavy work;
if __name__ == "__main__":guard required - Free-threaded (3.13–3.14, PEP 703/779) — GIL-less build; CPU concurrency actually works
- Sub-interpreter (3.14, PEP 734) — multiple interpreters in one process; new option
- Pitfalls: sync sleep, print concurrency, IPC cost, deadlocks
- Start sync → measure → choose by bottleneck
In the next post (#6 Advanced typing) we cover the next step from Intermediate #2 — the harder parts of typing: Variance, ParamSpec, Self, TypeGuard/TypeIs, and overload.