English · Español

01 — `async def`, the event loop, and why `def` blocks everything¶

🇪🇸 FastAPI corre sobre un event loop. Si tu handler es def (sync), corre en un threadpool y solo necesitas no compartir state. Si es async def y haces una llamada bloqueante (como tu modelo en CPU), te cargas el loop entero. La regla: ofloadea inference a un thread o usa def.

The two handler types¶

FastAPI accepts both:

@app.post("/correct")
def correct_sync(req): ...        # sync — runs in threadpool

@app.post("/correct")
async def correct_async(req): ...  # async — runs on event loop

What's the difference?

Sync (`def`) — automatic threadpool¶

When you define a sync handler, FastAPI runs it in a thread from anyio's default threadpool (default size ≈ 40). The event loop offloads each request and is free to handle the next request immediately.

Pro: No risk of blocking the event loop. Con: Bounded by threadpool size + GIL contention for CPU-bound work.

For our Mini-GPT (CPU, ~80 ms per request), sync is fine. Throughput is bounded by min(threadpool_size / T_correct, gil_concurrency). With 40 threads and the GIL released during NumPy calls (it is), you can get reasonable throughput.

Async (`async def`) — runs on the event loop¶

When you define an async handler, FastAPI runs it directly on the event loop. If anywhere in the handler you do a blocking call without yielding, you stall the loop for every other request.

@app.post("/correct")
async def correct_async(req):
    correction = agent.correct(req.sentence)  # ❌ BLOCKING — model is CPU-bound NumPy
    return ...

This is the single most common FastAPI mistake. The code "looks async" but secretly blocks the loop for 80 ms per request. Concurrency is effectively 1.

The fix: offload to a thread¶

import anyio

@app.post("/correct")
async def correct_async(req):
    correction = await anyio.to_thread.run_sync(agent.correct, req.sentence)
    return ...

anyio.to_thread.run_sync (FastAPI exposes this via asyncio.to_thread too) moves the call to a worker thread, so the event loop is free.

Rule of thumb¶

I/O bound (network, disk, DB): use async def with await aiohttp.get(...), await redis.get(...), etc. Native async libraries cooperate with the loop.
CPU bound (model inference, image processing): use def and let FastAPI's threadpool handle offloading. Or use async def + explicit to_thread.

For our agent, def is the right default. Lab 01 will load-test both and show that — if you avoid the async def + blocking pitfall — both behave similarly.

What about `uvicorn --workers`?¶

Running uvicorn app:app --workers 4 spawns 4 separate processes, each with its own event loop and threadpool. This gets around the GIL: model inference can run on 4 CPU cores in parallel.

Caveat: Each worker loads its own copy of the model weights — 4× memory. For our 103,680-param Mini-GPT this is trivial (4 × 400 KB ≈ 1.6 MB), but for a 7B model (~14 GB fp16), 4 workers would need 56 GB just for weights. This is why the batching schedulers in lab 02-03 are single-process — sharing the model means sharing GIL-protected access to the same weights.

For phase 33, we use --workers 1 so the scheduler can see all in-flight requests in one process.

The GIL gotcha for our workload¶

The CPython GIL (Global Interpreter Lock) prevents two threads from executing Python bytecode simultaneously. NumPy releases the GIL during heavy operations (matrix multiplies, BLAS calls). So our matmul-heavy forward passes can run in parallel across threads.

But: the agent loop's Python orchestration (if/else, dict lookups, json.loads) does not release the GIL. So adding more threads gives diminishing returns once the Python overhead matches the NumPy work.

For our Mini-GPT (which has more Python overhead than NumPy work, because \(d = 64\) is tiny), the GIL does matter. Threadpool concurrency saturates quickly. This is one of the motivations for batching: amortize the Python overhead across many requests.

Performance: sync vs async on our workload¶

Expected from lab 01 on a 4-core CPU with --workers 1:

Handler	50 concurrent clients	p50 latency	p95 latency	Throughput
`def` (threadpool)	50 reqs	~250 ms	~600 ms	~30 req/s
`async def` + blocking	50 reqs	~2000 ms	~3500 ms	~12 req/s
`async def` + `to_thread`	50 reqs	~270 ms	~620 ms	~30 req/s

The middle row is the pitfall. The top and bottom are equivalent (modulo a small overhead for to_thread).

What this file does NOT cover¶

Multi-process model parallelism. Phase 35.
Async DB drivers, async queue clients. Not used in this phase.
uvloop and other event-loop implementations. Marginal speedup; skipped.

Next: 02-static-vs-continuous-batching.md

01 — async def, the event loop, and why def blocks everything¶