Skip to content

English · Español

01 — async def, the event loop, and why def blocks everything

🇪🇸 FastAPI corre sobre un event loop. Si tu handler es def (sync), corre en un threadpool y solo necesitas no compartir state. Si es async def y haces una llamada bloqueante (como tu modelo en CPU), te cargas el loop entero. La regla: ofloadea inference a un thread o usa def.

The two handler types

FastAPI accepts both:

@app.post("/correct")
def correct_sync(req): ...        # sync — runs in threadpool

@app.post("/correct")
async def correct_async(req): ...  # async — runs on event loop

What's the difference?

Sync (def) — automatic threadpool

When you define a sync handler, FastAPI runs it in a thread from anyio's default threadpool (default size ≈ 40). The event loop offloads each request and is free to handle the next request immediately.

Pro: No risk of blocking the event loop. Con: Bounded by threadpool size + GIL contention for CPU-bound work.

For our Mini-GPT (CPU, ~80 ms per request), sync is fine. Throughput is bounded by min(threadpool_size / T_correct, gil_concurrency). With 40 threads and the GIL released during NumPy calls (it is), you can get reasonable throughput.

Async (async def) — runs on the event loop

When you define an async handler, FastAPI runs it directly on the event loop. If anywhere in the handler you do a blocking call without yielding, you stall the loop for every other request.

@app.post("/correct")
async def correct_async(req):
    correction = agent.correct(req.sentence)  # ❌ BLOCKING — model is CPU-bound NumPy
    return ...

This is the single most common FastAPI mistake. The code "looks async" but secretly blocks the loop for 80 ms per request. Concurrency is effectively 1.

The fix: offload to a thread

import anyio

@app.post("/correct")
async def correct_async(req):
    correction = await anyio.to_thread.run_sync(agent.correct, req.sentence)
    return ...

anyio.to_thread.run_sync (FastAPI exposes this via asyncio.to_thread too) moves the call to a worker thread, so the event loop is free.

Rule of thumb

  • I/O bound (network, disk, DB): use async def with await aiohttp.get(...), await redis.get(...), etc. Native async libraries cooperate with the loop.
  • CPU bound (model inference, image processing): use def and let FastAPI's threadpool handle offloading. Or use async def + explicit to_thread.

For our agent, def is the right default. Lab 01 will load-test both and show that — if you avoid the async def + blocking pitfall — both behave similarly.

What about uvicorn --workers?

Running uvicorn app:app --workers 4 spawns 4 separate processes, each with its own event loop and threadpool. This gets around the GIL: model inference can run on 4 CPU cores in parallel.

Caveat: Each worker loads its own copy of the model weights — 4× memory. For our 103,680-param Mini-GPT this is trivial (4 × 400 KB ≈ 1.6 MB), but for a 7B model (~14 GB fp16), 4 workers would need 56 GB just for weights. This is why the batching schedulers in lab 02-03 are single-process — sharing the model means sharing GIL-protected access to the same weights.

For phase 33, we use --workers 1 so the scheduler can see all in-flight requests in one process.

The GIL gotcha for our workload

The CPython GIL (Global Interpreter Lock) prevents two threads from executing Python bytecode simultaneously. NumPy releases the GIL during heavy operations (matrix multiplies, BLAS calls). So our matmul-heavy forward passes can run in parallel across threads.

But: the agent loop's Python orchestration (if/else, dict lookups, json.loads) does not release the GIL. So adding more threads gives diminishing returns once the Python overhead matches the NumPy work.

For our Mini-GPT (which has more Python overhead than NumPy work, because \(d = 64\) is tiny), the GIL does matter. Threadpool concurrency saturates quickly. This is one of the motivations for batching: amortize the Python overhead across many requests.

Performance: sync vs async on our workload

Expected from lab 01 on a 4-core CPU with --workers 1:

Handler 50 concurrent clients p50 latency p95 latency Throughput
def (threadpool) 50 reqs ~250 ms ~600 ms ~30 req/s
async def + blocking 50 reqs ~2000 ms ~3500 ms ~12 req/s
async def + to_thread 50 reqs ~270 ms ~620 ms ~30 req/s

The middle row is the pitfall. The top and bottom are equivalent (modulo a small overhead for to_thread).

What this file does NOT cover

  • Multi-process model parallelism. Phase 35.
  • Async DB drivers, async queue clients. Not used in this phase.
  • uvloop and other event-loop implementations. Marginal speedup; skipped.

Next: 02-static-vs-continuous-batching.md