Modern Python Advanced #5: GIL and concurrency — threading vs multiprocessing vs asyncio

Programming Language Python GIL concurrency

Tuesday, April 28, 2026

7 min read

In #4 Async in depth we briefly noted that “CPU-bound work isn’t solved by asyncio.” This is that post. Here we cover what the GIL is, the division of labor between threading/multiprocessing/asyncio, and the free-threaded build introduced in Python 3.13–3.14.

GIL — Global Interpreter Lock #

CPython (the standard Python implementation) has a global lock called the GIL. Only one thread can execute Python bytecode at a time.

Why it exists #

To safely manage CPython’s internal object reference counting. Without a lock, simultaneous reference manipulations would corrupt counters and break memory. Introduced for simplicity + single-thread performance + C-extension compatibility and kept for over 30 years.

Result — multithreading is meaningless for CPU-bound work #

🚫 Multi-threaded but no speedup

import threading

def cpu_heavy(n):
    total = 0
    for i in range(n):
        total += i ** 2
    return total

threads = [threading.Thread(target=cpu_heavy, args=(10_000_000,)) for _ in range(4)]
for t in threads: t.start()
for t in threads: t.join()

Doing the same job four times with four threads doesn’t go 4x faster. GIL forces effectively serial execution. Even with 8 CPU cores, only 1 is used.

Where the GIL releases — I/O #

Fortunately, the GIL is released during I/O operations. socket.recv, time.sleep, file reads, DB queries — other threads can progress concurrently. So multithreading is meaningful for I/O-bound work but not for CPU-bound.

C extensions like NumPy and Pandas often release the GIL during heavy computations, so numerical work gets some multithreading benefit. Varies by library.

The role of the three tools #

Tool	Fits	Cores used
`asyncio`	I/O-bound, thousands–tens of thousands concurrent	1
`threading`	I/O-bound, sync libraries, low concurrency	1
`multiprocessing`	CPU-bound	N
`concurrent.futures`	unified interface for both modes	depends

`threading` — concurrency for sync code #

threading basics

import threading

def fetch(url):
    response = requests.get(url)
    return response.text

threads = []
for url in urls:
    t = threading.Thread(target=fetch, args=(url,))
    t.start()
    threads.append(t)

for t in threads:
    t.join()

Pros:

Use sync libraries as-is
Minimal change to existing code
Fast enough for I/O-bound

Cons:

At thousands of concurrency, thread cost itself (memory, context switches)
Hard to debug with frequent locks/shared state
CPU-bound is meaningless due to the GIL

`Lock`, `RLock`, `Semaphore`, `Event` #

Protecting shared state

counter = 0
lock = threading.Lock()

def increment():
    global counter
    with lock:
        counter += 1

+= isn’t atomic (load → add → store). Without a lock, race conditions occur. Async doesn’t have this issue (one coroutine at a time), but threads do.

`concurrent.futures.ThreadPoolExecutor` — easier interface #

ThreadPoolExecutor

from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor(max_workers=10) as pool:
    results = list(pool.map(fetch, urls))

You rarely build Thread objects directly. ThreadPoolExecutor is the standard answer.

`multiprocessing` — true parallelism #

For CPU-bound work, send to separate processes. Each process has its own GIL, enabling true parallel execution.

multiprocessing basics

from concurrent.futures import ProcessPoolExecutor

def cpu_heavy(n):
    return sum(i ** 2 for i in range(n))

if __name__ == "__main__":
    with ProcessPoolExecutor(max_workers=8) as pool:
        results = list(pool.map(cpu_heavy, [10**7] * 8))

Eight processes use eight cores simultaneously. Almost 8x speedup.

Cost — process creation and IPC #

Process creation is more expensive than threads
Data transfer goes through serialization/deserialization (pickle) — large data is expensive
Debugging gets harder (process isolation)

So batching heavy computations is what makes it effective. Sending small jobs frequently lets IPC cost exceed the actual work.

`if name == "main":` — required #

Entry-point guard required

def worker(x): ...

if __name__ == "__main__":
    with ProcessPoolExecutor() as pool:
        pool.map(worker, [1, 2, 3])

multiprocessing re-imports the module when creating child processes. If pool.map(...) runs again in the child, you get infinite recursion. The guard is required.

Shared state — `Queue`, `Manager`, `shared_memory` #

Sharing data across processes is tricky.

Shared queue

from multiprocessing import Queue, Process

def worker(q):
    q.put("hello from worker")

if __name__ == "__main__":
    q = Queue()
    p = Process(target=worker, args=(q,))
    p.start()
    print(q.get())   # hello from worker
    p.join()

multiprocessing.Queue — a process-safe queue
multiprocessing.Manager — share dicts/lists via proxies
multiprocessing.shared_memory (3.8+) — share large arrays like numpy without copying

`asyncio` — once more #

The async from #4 is single-thread + cooperative yielding.

Aspect	asyncio
Strength	tens of thousands of concurrency, low memory, explicit yield points
Weakness	needs async libraries; no help for CPU-bound

Mixing asyncio with threading #

asyncio.to_thread for safely calling sync functions from async.

Mixing

import asyncio

async def fetch(url):
    return await asyncio.to_thread(requests.get, url)

Mixing asyncio with multiprocessing #

Async + process pool

import asyncio
from concurrent.futures import ProcessPoolExecutor

async def main():
    loop = asyncio.get_running_loop()
    with ProcessPoolExecutor() as pool:
        result = await loop.run_in_executor(pool, cpu_heavy, 10_000_000)

Send CPU-bound work to a separate process; await the result asynchronously.

Free-threaded — the big change in Python 3.13–3.14 #

PEP 703 introduced a GIL-less build. Experimental in 3.13, officially supported via PEP 779 in 3.14.

What changes #

The GIL is gone — CPU-bound multithreading actually works
Existing sync code in many threads automatically scales to core count
New concurrency models (single process + true multithreading) become possible

Cost — single-thread performance #

The GIL also kept single-thread performance simple and fast. Removing it costs some single-thread speed. As of 3.14, about 5–10% loss. Trending down with time.

How to use #

Use the free-threaded build

# Install free-threaded build with uv
uv python install 3.14t
# (t = free-threaded build)

# Have your project use it
uv init my-app --python 3.14t

Library compatibility #

Many C-extension libraries assume the GIL, so compatibility migration is in progress. Major libraries like NumPy, PyTorch, and Pillow have done or are doing free-threaded compatibility. For new projects, free-threaded is worth trying first; for projects with legacy dependencies, compatibility checks are essential.

Sub-interpreter — PEP 734 (3.14) #

Another direction for avoiding the GIL. Run multiple interpreters in one process, each with its own GIL.

sub-interpreter (3.14+)

from concurrent.futures import InterpreterPoolExecutor

with InterpreterPoolExecutor(max_workers=4) as pool:
    results = list(pool.map(cpu_heavy, [10**7] * 4))

Lighter than multiprocessing and safer for compatibility than free-threaded — a middle ground. Still new in 3.14, but likely to become an important option for CPU-bound concurrency.

Decision guide — one table #

Job	First try
1000 concurrent HTTP requests	`asyncio` (httpx)
50 HTTP requests, using a sync library	`ThreadPoolExecutor`
Heavy numerical work using 8 cores	`ProcessPoolExecutor`
Heavy numpy work that releases the GIL	`ThreadPoolExecutor` (or numpy’s own parallelism)
New project, mix CPU + I/O	3.14 free-threaded + threading
Stable isolation, share large data	`multiprocessing` + `shared_memory`
Multiple CPU-bound in one process	`InterpreterPoolExecutor` (3.14+)

Common pitfalls #

1) Mixing `time.sleep` with `asyncio.sleep` #

🚫 sync sleep inside async

async def fetch(url):
    time.sleep(1)    # holds GIL + freezes the event loop
    ...

Saw this in Intermediate #7. Always await asyncio.sleep.

2) `print` isn’t thread-safe #

thread + print

def worker(i):
    print(f"start {i}")
    do_work()
    print(f"end {i}")

When threads print at the same time, lines may interleave. The logging module is thread-safe — use it.

3) Sending NumPy arrays through multiprocessing often #

🚫 Pickling big arrays every time

with ProcessPoolExecutor() as pool:
    pool.map(process, [big_numpy_array] * 100)

Serialize the array (pickle) → send to child → child deserializes. IPC cost exceeds the computation. Consider shared_memory or numpy’s own parallelism (BLAS/LAPACK multithreading).

4) Deadlocks #

Acquiring multiple locks in different orders causes deadlocks.

🚫 Deadlock

def t1():
    with lock_a:
        with lock_b:    # if t2 holds b and waits for a, deadlock
            ...

def t2():
    with lock_b:
        with lock_a:
            ...

Rule: keep lock acquisition order consistent across the codebase. Or use threading.RLock for reentrant locks.

Practical advice — how to start #

Step-by-step

1. Get the sync code working first
2. Measure the bottleneck (covered in #7)
3. I/O-bound bottleneck → asyncio or ThreadPoolExecutor
4. CPU-bound bottleneck → ProcessPoolExecutor or free-threaded
5. None of those? → faster library/Cython/Rust extension

Premature optimization applies here too. If sync code is fast enough, not using concurrency is the right answer.

Wrap-up #

What this post covered:

GIL — CPython’s global lock; only one thread executes bytecode at a time
I/O releases the GIL — multithreading is meaningful; CPU-bound isn’t
asyncio (single thread, large-scale concurrency), threading (I/O + sync code), multiprocessing (CPU-bound)
ThreadPoolExecutor / ProcessPoolExecutor in concurrent.futures are the standard
multiprocessing has IPC cost — bundle heavy work; if __name__ == "__main__": guard required
Free-threaded (3.13–3.14, PEP 703/779) — GIL-less build; CPU concurrency actually works
Sub-interpreter (3.14, PEP 734) — multiple interpreters in one process; new option
Pitfalls: sync sleep, print concurrency, IPC cost, deadlocks
Start sync → measure → choose by bottleneck

In the next post (#6 Advanced typing) we cover the next step from Intermediate #2 — the harder parts of typing: Variance, ParamSpec, Self, TypeGuard/TypeIs, and overload.

GIL — Global Interpreter Lock #

Why it exists #

Result — multithreading is meaningless for CPU-bound work #

Where the GIL releases — I/O #

The role of the three tools #

threading — concurrency for sync code #

Lock, RLock, Semaphore, Event #

concurrent.futures.ThreadPoolExecutor — easier interface #

multiprocessing — true parallelism #

Cost — process creation and IPC #

if __name__ == "__main__": — required #

Shared state — Queue, Manager, shared_memory #

asyncio — once more #

Mixing asyncio with threading #

Mixing asyncio with multiprocessing #

Free-threaded — the big change in Python 3.13–3.14 #

What changes #

Cost — single-thread performance #

How to use #

Library compatibility #

Sub-interpreter — PEP 734 (3.14) #

Decision guide — one table #

Common pitfalls #

1) Mixing time.sleep with asyncio.sleep #

2) print isn’t thread-safe #

3) Sending NumPy arrays through multiprocessing often #

4) Deadlocks #

Practical advice — how to start #

Wrap-up #

`threading` — concurrency for sync code #

`Lock`, `RLock`, `Semaphore`, `Event` #

`concurrent.futures.ThreadPoolExecutor` — easier interface #

`multiprocessing` — true parallelism #

`if name == "main":` — required #

Shared state — `Queue`, `Manager`, `shared_memory` #

`asyncio` — once more #

1) Mixing `time.sleep` with `asyncio.sleep` #

2) `print` isn’t thread-safe #