English · Español
04 — Distributed inference: TP for serving and disaggregated prefill/decode¶
🇪🇸 Inferencia no es entrenamiento. No hay backward, no hay gradientes, no hay optimizador. Lo que sí hay: KV cache que crece por token, peticiones concurrentes con shapes distintos, y el famoso desbalance prefill (compute-bound) vs decode (memory-bound). La paralelización para servir tiene reglas distintas.
Training distributed and inference distributed have different bottlenecks. This page maps the differences and covers two patterns: tensor-parallel inference (the one we touch in lab 02) and disaggregated prefill/decode (the current frontier; conceptual coverage only).
Why inference is different from training¶
In training:
- Forward + backward + optimizer per step.
- Every step has the same shape (after padding to max).
- You can batch many sequences together → high arithmetic intensity → compute-bound.
- Comm is once per step (DDP all-reduce) or several times (ZeRO/TP), but predictable.
In inference:
- Forward only. No gradients, no optimizer state, no Adam momentum.
- Requests arrive with different prompt lengths and different output lengths → hard to batch perfectly.
- Prefill (process the prompt) is compute-bound (large matrix-matrix mults).
- Decode (one token at a time after prefill) is memory-bound — load a row of weights, do one matrix-vector mult, repeat.
- The two stages have opposite hardware profiles. A single GPU running both is over- and under-utilized at different moments.
Consequences:
- Inference clusters need different optimization targets. Throughput matters less per-GPU than per-request latency (TTFT, ITL).
- Sharding strategies that work for training may not work for inference, and vice versa.
- The same model size that needs 16-GPU TP training might fit on 1 GPU for inference (no gradients, no optimizer state — 4× memory reduction).
Tensor-parallel inference¶
The TP scheme from theory/02-parallelism-flavors.md works for inference too: shard each layer's weights, run forward with intra-layer all-reduces.
What changes from training¶
- No backward pass → no need for the second batch of all-reduces.
- KV cache is sharded along the same axis as Q/K/V projections — each TP worker keeps its
n_heads / Nheads' worth of KV. - All-reduce happens on the residual stream after the attention output projection and after the MLP down-projection — same pattern as training, half the volume.
When is TP inference worth it?¶
| Situation | Use TP inference? |
|---|---|
| Model fits on one GPU, single user | No. Single-GPU is simpler and faster (no comm). |
| Model fits on one GPU, many users | Maybe. TP cuts per-request latency at the cost of higher per-token comm. |
| Model is too big for one GPU | Yes. TP is the standard option. |
| Tight TTFT requirement | TP can help — splits prefill compute across workers. |
| Tight tokens/sec/$ requirement | Single-GPU + continuous batching usually wins (no comm). |
For the grammar tutor: the model fits on a calculator, so production-wise TP inference is wasteful. Lab 02 runs it anyway because the only way to see the all-reduce pattern is to write the all-reduce. Expected result: 2-GPU TP is slower than 1-GPU for this tiny model. That's the lesson. Comm-volume crosses compute-volume; you'd never deploy this way.
The forward-looking scenario: a hypothetical 600k-form grammar tutor (5 languages × 20 verbs × 5 tenses × 3 persons × paraphrases) with \(d_{\text{model}} = 4096\) would need TP-sharding of the embedding table just to fit. That's the point at which lab 02's pattern becomes economically right.
Disaggregated prefill/decode¶
A frontier serving pattern (Splitwise, DistServe, Mooncake): separate workers handle prefill vs decode, instead of one worker doing both.
Why disaggregate¶
Prefill and decode have different optimal hardware:
- Prefill: compute-bound. Wants high FLOPs/$. A100 or H100 with high SM count is great. Latency sensitive (this is the TTFT).
- Decode: memory-bound. Wants high memory bandwidth/$. H100 has both; a Hopper or even a Gaudi can be fine. Latency sensitive too (this is the ITL).
In a single shared cluster, the prefill instances occasionally idle (no new prompts) while decode instances thrash (many concurrent users mid-generation). The arithmetic for a co-located cluster says you provision for the peak of either — wasteful.
Disaggregating means: provision a small prefill pool sized for prompt-arrival rate, and a larger decode pool sized for concurrent generation. KV cache transfers from prefill worker to decode worker once per request (RDMA or NVLink). The transfer is a one-time cost amortized over the entire decode.
Pattern sketch¶
Client → Router → Prefill pool (small, compute-tuned)
│ (KV cache hand-off, RDMA)
▼
Decode pool (larger, bandwidth-tuned)
│ (token stream)
▼
Client (streaming)
The router decides when to forward; the KV cache transfer is the new latency component to measure (added to TTFT).
Why not in this curriculum¶
Implementing disaggregated prefill/decode requires:
- Two separately scaled GPU pools (≥ 4 GPUs).
- RDMA-capable interconnect for KV transfer.
- A request router that knows the request lifecycle.
Total system cost: $5+/hr. Phase 35 budget: $5 total. The math says read papers, don't implement. Lab 03 includes a brief annotated reading of the DistServe paper.
Speculative decoding¶
A different optimization, mentioned here because it's often deployed alongside TP/disaggregated inference.
A small draft model proposes the next \(k\) tokens cheaply; the target (real) model verifies all \(k\) in a single forward pass and accepts the longest correct prefix. Net: instead of one token per forward, you get 2–5 tokens per forward on average.
For the grammar tutor: the draft model would need to be even tinier than the already-tiny tutor. Probably an n-gram model from Phase 14. Conceptually doable, practically not worth the implementation cost.
Phase 36 covers Medusa / EAGLE / Lookahead in more depth (those are speculative-decoding-architecture variants).
Inference distributed checklist¶
When designing a distributed inference setup:
- Does the model fit on one GPU? If yes, start single-GPU.
- Are you latency-sensitive (TTFT/ITL)? If yes, consider TP within a node.
- Are you throughput-sensitive (tokens/sec/$)? Single-GPU + continuous batching first; add TP only if memory forces.
- Is your KV cache > 50% of memory? Look at PagedAttention (vLLM) before adding more GPUs.
- Are prefill and decode loads imbalanced? Disaggregated prefill/decode — but only at sufficient scale.
For the grammar tutor: answer is "no, no, no, no, no" — single GPU (or CPU) suffices. Lab 02 violates the checklist intentionally to teach.
What this phase does NOT cover¶
- Implementing disaggregated prefill/decode. Read-only.
- Speculative decoding implementation (Medusa, EAGLE, Lookahead). Phase 36's territory; vocabulary only here.
- Cross-region / cross-AZ inference routing. Production-ops territory; Phase 38.
- Inference autoscaling on tokens/sec. Phase 38.
- KV-cache offloading across workers (CXL-memory pools, etc.). Frontier; vocabulary only.
Next: lab/00-cloud-budget-and-tooling.md.