Modern Python Advanced #5: GIL and concurrency — threading vs multiprocessing vs asyncio

7 min read

In #4 Async in depth we briefly noted that “CPU-bound work isn’t solved by asyncio.” This is that post. Here we cover what the GIL is, the division of labor between threading/multiprocessing/asyncio, and the free-threaded build introduced in Python 3.13–3.14.

GIL — Global Interpreter Lock #

CPython (the standard Python implementation) has a global lock called the GIL. Only one thread can execute Python bytecode at a time.

Why it exists #

To safely manage CPython’s internal object reference counting. Without a lock, simultaneous reference manipulations would corrupt counters and break memory. Introduced for simplicity + single-thread performance + C-extension compatibility and kept for over 30 years.

Result — multithreading is meaningless for CPU-bound work #

🚫 Multi-threaded but no speedup
import threading

def cpu_heavy(n):
    total = 0
    for i in range(n):
        total += i ** 2
    return total

threads = [threading.Thread(target=cpu_heavy, args=(10_000_000,)) for _ in range(4)]
for t in threads: t.start()
for t in threads: t.join()

Doing the same job four times with four threads doesn’t go 4x faster. GIL forces effectively serial execution. Even with 8 CPU cores, only 1 is used.

Where the GIL releases — I/O #

Fortunately, the GIL is released during I/O operations. socket.recv, time.sleep, file reads, DB queries — other threads can progress concurrently. So multithreading is meaningful for I/O-bound work but not for CPU-bound.

C extensions like NumPy and Pandas often release the GIL during heavy computations, so numerical work gets some multithreading benefit. Varies by library.

The role of the three tools #

ToolFitsCores used
asyncioI/O-bound, thousands–tens of thousands concurrent1
threadingI/O-bound, sync libraries, low concurrency1
multiprocessingCPU-boundN
concurrent.futuresunified interface for both modesdepends

threading — concurrency for sync code #

threading basics
import threading

def fetch(url):
    response = requests.get(url)
    return response.text

threads = []
for url in urls:
    t = threading.Thread(target=fetch, args=(url,))
    t.start()
    threads.append(t)

for t in threads:
    t.join()

Pros:

  • Use sync libraries as-is
  • Minimal change to existing code
  • Fast enough for I/O-bound

Cons:

  • At thousands of concurrency, thread cost itself (memory, context switches)
  • Hard to debug with frequent locks/shared state
  • CPU-bound is meaningless due to the GIL

Lock, RLock, Semaphore, Event #

Protecting shared state
counter = 0
lock = threading.Lock()

def increment():
    global counter
    with lock:
        counter += 1

+= isn’t atomic (load → add → store). Without a lock, race conditions occur. Async doesn’t have this issue (one coroutine at a time), but threads do.

concurrent.futures.ThreadPoolExecutor — easier interface #

ThreadPoolExecutor
from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor(max_workers=10) as pool:
    results = list(pool.map(fetch, urls))

You rarely build Thread objects directly. ThreadPoolExecutor is the standard answer.

multiprocessing — true parallelism #

For CPU-bound work, send to separate processes. Each process has its own GIL, enabling true parallel execution.

multiprocessing basics
from concurrent.futures import ProcessPoolExecutor

def cpu_heavy(n):
    return sum(i ** 2 for i in range(n))

if __name__ == "__main__":
    with ProcessPoolExecutor(max_workers=8) as pool:
        results = list(pool.map(cpu_heavy, [10**7] * 8))

Eight processes use eight cores simultaneously. Almost 8x speedup.

Cost — process creation and IPC #

  • Process creation is more expensive than threads
  • Data transfer goes through serialization/deserialization (pickle) — large data is expensive
  • Debugging gets harder (process isolation)

So batching heavy computations is what makes it effective. Sending small jobs frequently lets IPC cost exceed the actual work.

if __name__ == "__main__": — required #

Entry-point guard required
def worker(x): ...

if __name__ == "__main__":
    with ProcessPoolExecutor() as pool:
        pool.map(worker, [1, 2, 3])

multiprocessing re-imports the module when creating child processes. If pool.map(...) runs again in the child, you get infinite recursion. The guard is required.

Shared state — Queue, Manager, shared_memory #

Sharing data across processes is tricky.

Shared queue
from multiprocessing import Queue, Process

def worker(q):
    q.put("hello from worker")

if __name__ == "__main__":
    q = Queue()
    p = Process(target=worker, args=(q,))
    p.start()
    print(q.get())   # hello from worker
    p.join()
  • multiprocessing.Queue — a process-safe queue
  • multiprocessing.Manager — share dicts/lists via proxies
  • multiprocessing.shared_memory (3.8+) — share large arrays like numpy without copying

asyncio — once more #

The async from #4 is single-thread + cooperative yielding.

Aspectasyncio
Strengthtens of thousands of concurrency, low memory, explicit yield points
Weaknessneeds async libraries; no help for CPU-bound

Mixing asyncio with threading #

asyncio.to_thread for safely calling sync functions from async.

Mixing
import asyncio

async def fetch(url):
    return await asyncio.to_thread(requests.get, url)

Mixing asyncio with multiprocessing #

Async + process pool
import asyncio
from concurrent.futures import ProcessPoolExecutor

async def main():
    loop = asyncio.get_running_loop()
    with ProcessPoolExecutor() as pool:
        result = await loop.run_in_executor(pool, cpu_heavy, 10_000_000)

Send CPU-bound work to a separate process; await the result asynchronously.

Free-threaded — the big change in Python 3.13–3.14 #

PEP 703 introduced a GIL-less build. Experimental in 3.13, officially supported via PEP 779 in 3.14.

What changes #

  • The GIL is gone — CPU-bound multithreading actually works
  • Existing sync code in many threads automatically scales to core count
  • New concurrency models (single process + true multithreading) become possible

Cost — single-thread performance #

The GIL also kept single-thread performance simple and fast. Removing it costs some single-thread speed. As of 3.14, about 5–10% loss. Trending down with time.

How to use #

Use the free-threaded build
# Install free-threaded build with uv
uv python install 3.14t
# (t = free-threaded build)

# Have your project use it
uv init my-app --python 3.14t

Library compatibility #

Many C-extension libraries assume the GIL, so compatibility migration is in progress. Major libraries like NumPy, PyTorch, and Pillow have done or are doing free-threaded compatibility. For new projects, free-threaded is worth trying first; for projects with legacy dependencies, compatibility checks are essential.

Sub-interpreter — PEP 734 (3.14) #

Another direction for avoiding the GIL. Run multiple interpreters in one process, each with its own GIL.

sub-interpreter (3.14+)
from concurrent.futures import InterpreterPoolExecutor

with InterpreterPoolExecutor(max_workers=4) as pool:
    results = list(pool.map(cpu_heavy, [10**7] * 4))

Lighter than multiprocessing and safer for compatibility than free-threaded — a middle ground. Still new in 3.14, but likely to become an important option for CPU-bound concurrency.

Decision guide — one table #

JobFirst try
1000 concurrent HTTP requestsasyncio (httpx)
50 HTTP requests, using a sync libraryThreadPoolExecutor
Heavy numerical work using 8 coresProcessPoolExecutor
Heavy numpy work that releases the GILThreadPoolExecutor (or numpy’s own parallelism)
New project, mix CPU + I/O3.14 free-threaded + threading
Stable isolation, share large datamultiprocessing + shared_memory
Multiple CPU-bound in one processInterpreterPoolExecutor (3.14+)

Common pitfalls #

1) Mixing time.sleep with asyncio.sleep #

🚫 sync sleep inside async
async def fetch(url):
    time.sleep(1)    # holds GIL + freezes the event loop
    ...

Saw this in Intermediate #7. Always await asyncio.sleep.

2) print isn’t thread-safe #

thread + print
def worker(i):
    print(f"start {i}")
    do_work()
    print(f"end {i}")

When threads print at the same time, lines may interleave. The logging module is thread-safe — use it.

3) Sending NumPy arrays through multiprocessing often #

🚫 Pickling big arrays every time
with ProcessPoolExecutor() as pool:
    pool.map(process, [big_numpy_array] * 100)

Serialize the array (pickle) → send to child → child deserializes. IPC cost exceeds the computation. Consider shared_memory or numpy’s own parallelism (BLAS/LAPACK multithreading).

4) Deadlocks #

Acquiring multiple locks in different orders causes deadlocks.

🚫 Deadlock
def t1():
    with lock_a:
        with lock_b:    # if t2 holds b and waits for a, deadlock
            ...

def t2():
    with lock_b:
        with lock_a:
            ...

Rule: keep lock acquisition order consistent across the codebase. Or use threading.RLock for reentrant locks.

Practical advice — how to start #

Step-by-step
1. Get the sync code working first
2. Measure the bottleneck (covered in #7)
3. I/O-bound bottleneck → asyncio or ThreadPoolExecutor
4. CPU-bound bottleneck → ProcessPoolExecutor or free-threaded
5. None of those? → faster library/Cython/Rust extension

Premature optimization applies here too. If sync code is fast enough, not using concurrency is the right answer.

Wrap-up #

What this post covered:

  • GIL — CPython’s global lock; only one thread executes bytecode at a time
  • I/O releases the GIL — multithreading is meaningful; CPU-bound isn’t
  • asyncio (single thread, large-scale concurrency), threading (I/O + sync code), multiprocessing (CPU-bound)
  • ThreadPoolExecutor / ProcessPoolExecutor in concurrent.futures are the standard
  • multiprocessing has IPC cost — bundle heavy work; if __name__ == "__main__": guard required
  • Free-threaded (3.13–3.14, PEP 703/779) — GIL-less build; CPU concurrency actually works
  • Sub-interpreter (3.14, PEP 734) — multiple interpreters in one process; new option
  • Pitfalls: sync sleep, print concurrency, IPC cost, deadlocks
  • Start sync → measure → choose by bottleneck

In the next post (#6 Advanced typing) we cover the next step from Intermediate #2 — the harder parts of typing: Variance, ParamSpec, Self, TypeGuard/TypeIs, and overload.

X