English · Español

02 — Static vs continuous batching: the core of modern LLM serving¶

🇪🇸 Static batching es "junta N peticiones, ejecuta el batch entero". Continuous batching es "ejecuta un paso del decode para todas las que están vivas; cuando una emite EOS, sale; nuevas peticiones entran entre pasos". El cambio es de un eje (request) a otro (token-step).

The model's forward pass amortises fixed cost¶

A single forward pass through the Mini-GPT on a sequence of length \(T\) does \(O(T \cdot d^2)\) matmul work. With batch dimension \(B\) — that is, \(B\) sequences processed in parallel — the work is \(O(B \cdot T \cdot d^2)\), but the Python overhead (the orchestration, the slot-setting, the cache lookups) is essentially constant in \(B\).

So: at low \(B\), you're Python-bound; at high \(B\), you're matmul-bound; the throughput-vs-\(B\) curve saturates somewhere in between.

This is the whole motivation for batching. One forward pass on \(B\) requests is much cheaper than \(B\) forward passes on 1 request each.

Static batching¶

The naive approach.

def static_batch_loop():
    while True:
        batch = []
        deadline = time.time() + MAX_WAIT_MS / 1000
        while len(batch) < MAX_BATCH and time.time() < deadline:
            batch.append(queue.get(timeout=...))
        if batch:
            run_full_inference(batch)   # generate until ALL requests in batch finish

The scheduler collects up to MAX_BATCH requests (or until MAX_WAIT_MS elapses), then runs the entire generation loop on that batch.

The problem: requests in the same batch have different generation lengths. If request A wants 3 tokens and request B wants 30, the batch runs for 30 steps. Request A finishes its 3 tokens but has to wait while B finishes its 30 — and the batch's response is sent at the end.

This is the tail-latency disaster of static batching. Fast requests pay for slow ones.

Illustration:

Request A (3 tokens): [t0][t1][t2][wait][wait][wait]...[wait]   end-time: 30 steps
Request B (30 tokens):[t0][t1][t2][t3 ][t4 ][t5 ]...[t29 ]      end-time: 30 steps

p95 latency = the 95^th-percentile generation length × per-step time. Not good.

Continuous batching¶

The change: schedule at the token-step level, not the request level.

def continuous_batch_loop():
    in_flight: list[Request] = []
    while True:
        # Admit new requests up to capacity
        while len(in_flight) < MAX_INFLIGHT and not queue.empty():
            in_flight.append(queue.get_nowait())

        if not in_flight:
            time.sleep(IDLE_MS / 1000)
            continue

        # Run ONE decode step for all in-flight requests
        step_outputs = model.forward_one_step(in_flight)   # batched

        for req, out_token in zip(in_flight, step_outputs):
            req.append(out_token)
            if out_token == EOS or req.length >= req.max_tokens:
                req.send_response()
        in_flight = [r for r in in_flight if not r.done]

Now:

Request A (3 tokens): [t0][t1][t2][DONE]                         end-time: 3 steps
Request B (30 tokens):[t0][t1][t2][t3 ][t4 ][t5 ]...[t29 ][DONE] end-time: 30 steps

Request A's response is sent at step 3, not step 30. Tail latency drops dramatically for short requests.

This is the technique that vLLM, TGI, Triton, and every modern inference server uses. The name "continuous batching" comes from the fact that the batch continuously changes — requests join and leave on every step.

What makes continuous batching tricky?¶

Per-request KV cache. Each request has its own KV cache. The forward pass must read from \(B\) different caches and write to \(B\) different caches. This is what PagedAttention (Phase 27) addresses at scale — managing many variable-length KV caches without massive memory waste.
Variable sequence lengths in a batch. When requests A (current length 5) and B (current length 12) are in the same batch step, the attention has to handle different sequence lengths. Padding or masking is involved. (We sidestep this in lab 03 by only having the new-token query attend to the cached K/V, which is naturally per-request.)
Admission control. When to admit a new request? If your in-flight set is at capacity, the new request waits. If you over-admit, latency tanks for everyone.
Prefill vs decode. A request's first forward (the prefill on the prompt) is much more expensive than each subsequent decode step. Mixing prefill into a decode batch is hard. Production systems separate prefill and decode batches; we'll mention this in lab 04's survey.

For Phase 33 we'll handle items 1 and 2 trivially (small Mini-GPT, per-request cache as a dict), and item 3 minimally (FIFO admission up to a cap).

The expected gain¶

For a load with mixed generation lengths (some short corrections, some longer explanations), continuous batching's p95 latency should drop by ~30-70% vs static batching at the same average throughput. That's the figure the lab will measure.

For a load where all requests have the same length, continuous batching and static batching are equivalent — no fast requests waiting on slow ones.

A counterintuitive observation¶

Continuous batching doesn't necessarily increase throughput. The total work (sum of all FLOPs) is the same. What changes is the distribution of latencies — the long-tail shrinks.

If you measure only requests/s (throughput), you might see no improvement. If you measure p95 latency, you'll see a clear win. This is why the DoD specifies p95.

What this file does NOT cover¶

PagedAttention. Phase 27.
Speculative decoding inside the scheduler. Phase 36.
Disaggregated prefill/decode. Out of scope (advanced production tech).

Next: 03-littles-law-and-capacity.md