Skip to content

English · Español

00 — Why HTTP serving is not just while True: agent.correct()

🇪🇸 Tu tutor de gramática es una función de Python. Servirla por HTTP introduce un mundo nuevo: concurrencia, colas, latencia de cola, batching. Esta fase trata sobre lo que hay entre el cliente y agent.correct().

The shape of an inference request

A user types "He goed to school" into a frontend somewhere. From that moment to "they see the correction":

client → DNS → load balancer → reverse proxy → FastAPI process → handler → agent → model → response (reverse)
                                                       └─ this phase is here

Phase 33 is about everything that happens after FastAPI receives the request and before the response is constructed.

The single-request case (Phase 32 baseline)

In Phase 32 we built agent.correct(sentence) -> Correction. Calling it once takes some wall-clock time \(T\). For our Mini-GPT on a CPU, \(T\) might be 200-500 ms for a short sentence (4-8 tokens to generate, no KV cache yet — Phase 22 fixed that).

If only one request ever existed at a time, we'd write:

@app.post("/correct")
def correct(req: CorrectRequest) -> CorrectResponse:
    correction = agent.correct(req.sentence)
    return CorrectResponse(corrected=correction.text, explanation=correction.why)

Done. Deploy it, you're a "MLOps engineer."

The N-concurrent-requests case

Now imagine 50 users send requests at the same second.

Naive sync: the FastAPI worker serves them one at a time. Request 50 waits for requests 1-49 to complete. If each takes 300 ms, request 50 waits 15 seconds. Tail latency: terrible.

Async + thread offload: the event loop interleaves, but each model call still occupies one CPU thread. If you have 8 threads, throughput is bounded at \(8 / T\) req/s. Tail latency improves but you're still doing 50 separate forward passes.

Static batching: "Wait up to 100 ms, then batch whatever I have and run them together." Throughput goes up (one batched forward pass amortises the model's fixed cost). But: the slowest request in the batch determines the latency of all of them. Fast requests pay for slow ones.

Continuous batching: at each token-decoding step, pick the in-flight requests that need the next token and run one step for all of them. When a request emits EOS, it leaves the batch immediately — the others keep going. New requests can join the batch between steps. This is the production sweet spot.

This phase walks all four points.

What you'll feel

By the end of lab 03 you will have typed the continuous-batching scheduler — not used one. The scheduler is ~150 lines of Python. Once you've written it, words like "in-flight batching," "iteration-level scheduling," and "PagedAttention's slot manager" stop being magic.

You'll also discover (by load-testing your own service) that the bottleneck moves. Without batching, the bottleneck is the model's per-request cost. With batching, the bottleneck becomes the scheduler's overhead — and if you batch too aggressively, the queue itself becomes the bottleneck.

Why CPU-only is fine here

The Mini-GPT is small enough that a CPU forward pass takes ~10-50 ms. That's comparable to one network round-trip. Continuous batching shows clear wins even at that scale because the scheduling logic is what we're measuring, not raw FLOPs.

For a 70B-parameter LLM on an H100, the same scheduling logic applies — the constants change (GPU forward = 30 ms, batch size = 256, KV memory per request matters more). The shape of the problem and the code is the same.

The single most important number to internalize

For our Mini-GPT generating \(\ell \approx 4\) tokens per correction on CPU at ~50 tokens/s (~20 ms per decode step with KV cache from Phase 22), one correction takes:

\[T_\text{correct} \approx 4 \text{ tokens} \times 20 \text{ ms/token} = 80 \text{ ms}.\]

Without batching, throughput is bounded at \(1 / T_\text{correct} = 12.5\) req/s. With static batching of 8 requests, throughput rises to ~\(8 / T_\text{correct} \approx 100\) req/s (modulo padding waste). With continuous batching, throughput approaches the bound but with much better tail latency. Order-of-magnitude, this is what you should expect before you measure.

What this file does NOT cover

  • The event loop mechanics. Next file.
  • The batching math itself. File 02.
  • Queue sizing. File 03.

Next: 01-async-and-the-event-loop.md