English · Español

04 — Distributed inference: TP for serving and disaggregated prefill/decode¶

🇪🇸 Inferencia no es entrenamiento. No hay backward, no hay gradientes, no hay optimizador. Lo que sí hay: KV cache que crece por token, peticiones concurrentes con shapes distintos, y el famoso desbalance prefill (compute-bound) vs decode (memory-bound). La paralelización para servir tiene reglas distintas.

Training distributed and inference distributed have different bottlenecks. This page maps the differences and covers two patterns: tensor-parallel inference (the one we touch in lab 02) and disaggregated prefill/decode (the current frontier; conceptual coverage only).

Why inference is different from training¶

In training:

Forward + backward + optimizer per step.
Every step has the same shape (after padding to max).
You can batch many sequences together → high arithmetic intensity → compute-bound.
Comm is once per step (DDP all-reduce) or several times (ZeRO/TP), but predictable.

In inference:

Forward only. No gradients, no optimizer state, no Adam momentum.
Requests arrive with different prompt lengths and different output lengths → hard to batch perfectly.
Prefill (process the prompt) is compute-bound (large matrix-matrix mults).
Decode (one token at a time after prefill) is memory-bound — load a row of weights, do one matrix-vector mult, repeat.
The two stages have opposite hardware profiles. A single GPU running both is over- and under-utilized at different moments.

Consequences:

Inference clusters need different optimization targets. Throughput matters less per-GPU than per-request latency (TTFT, ITL).
Sharding strategies that work for training may not work for inference, and vice versa.
The same model size that needs 16-GPU TP training might fit on 1 GPU for inference (no gradients, no optimizer state — 4× memory reduction).

Tensor-parallel inference¶

The TP scheme from theory/02-parallelism-flavors.md works for inference too: shard each layer's weights, run forward with intra-layer all-reduces.

What changes from training¶

No backward pass → no need for the second batch of all-reduces.
KV cache is sharded along the same axis as Q/K/V projections — each TP worker keeps its n_heads / N heads' worth of KV.
All-reduce happens on the residual stream after the attention output projection and after the MLP down-projection — same pattern as training, half the volume.

When is TP inference worth it?¶

Situation	Use TP inference?
Model fits on one GPU, single user	No. Single-GPU is simpler and faster (no comm).
Model fits on one GPU, many users	Maybe. TP cuts per-request latency at the cost of higher per-token comm.
Model is too big for one GPU	Yes. TP is the standard option.
Tight TTFT requirement	TP can help — splits prefill compute across workers.
Tight tokens/sec/$ requirement	Single-GPU + continuous batching usually wins (no comm).

For the grammar tutor: the model fits on a calculator, so production-wise TP inference is wasteful. Lab 02 runs it anyway because the only way to see the all-reduce pattern is to write the all-reduce. Expected result: 2-GPU TP is slower than 1-GPU for this tiny model. That's the lesson. Comm-volume crosses compute-volume; you'd never deploy this way.

The forward-looking scenario: a hypothetical 600k-form grammar tutor (5 languages × 20 verbs × 5 tenses × 3 persons × paraphrases) with $d_{\text{model}} = 4096$ would need TP-sharding of the embedding table just to fit. That's the point at which lab 02's pattern becomes economically right.

Disaggregated prefill/decode¶

A frontier serving pattern (Splitwise, DistServe, Mooncake): separate workers handle prefill vs decode, instead of one worker doing both.

Why disaggregate¶

Prefill and decode have different optimal hardware:

Prefill: compute-bound. Wants high FLOPs/$. A100 or H100 with high SM count is great. Latency sensitive (this is the TTFT).
Decode: memory-bound. Wants high memory bandwidth/$. H100 has both; a Hopper or even a Gaudi can be fine. Latency sensitive too (this is the ITL).

In a single shared cluster, the prefill instances occasionally idle (no new prompts) while decode instances thrash (many concurrent users mid-generation). The arithmetic for a co-located cluster says you provision for the peak of either — wasteful.

Disaggregating means: provision a small prefill pool sized for prompt-arrival rate, and a larger decode pool sized for concurrent generation. KV cache transfers from prefill worker to decode worker once per request (RDMA or NVLink). The transfer is a one-time cost amortized over the entire decode.

Pattern sketch¶

Client → Router → Prefill pool (small, compute-tuned)
                       │ (KV cache hand-off, RDMA)
                       ▼
                  Decode pool (larger, bandwidth-tuned)
                       │ (token stream)
                       ▼
                  Client (streaming)

The router decides when to forward; the KV cache transfer is the new latency component to measure (added to TTFT).

Why not in this curriculum¶

Implementing disaggregated prefill/decode requires:

Two separately scaled GPU pools (≥ 4 GPUs).
RDMA-capable interconnect for KV transfer.
A request router that knows the request lifecycle.

Total system cost: $5+/hr. Phase 35 budget: $5 total. The math says read papers, don't implement. Lab 03 includes a brief annotated reading of the DistServe paper.

Speculative decoding¶

A different optimization, mentioned here because it's often deployed alongside TP/disaggregated inference.

A small draft model proposes the next $k$ tokens cheaply; the target (real) model verifies all $k$ in a single forward pass and accepts the longest correct prefix. Net: instead of one token per forward, you get 2–5 tokens per forward on average.

For the grammar tutor: the draft model would need to be even tinier than the already-tiny tutor. Probably an n-gram model from Phase 14. Conceptually doable, practically not worth the implementation cost.

Phase 36 covers Medusa / EAGLE / Lookahead in more depth (those are speculative-decoding-architecture variants).

Inference distributed checklist¶

When designing a distributed inference setup:

Does the model fit on one GPU? If yes, start single-GPU.
Are you latency-sensitive (TTFT/ITL)? If yes, consider TP within a node.
Are you throughput-sensitive (tokens/sec/$)? Single-GPU + continuous batching first; add TP only if memory forces.
Is your KV cache > 50% of memory? Look at PagedAttention (vLLM) before adding more GPUs.
Are prefill and decode loads imbalanced? Disaggregated prefill/decode — but only at sufficient scale.

For the grammar tutor: answer is "no, no, no, no, no" — single GPU (or CPU) suffices. Lab 02 violates the checklist intentionally to teach.

What this phase does NOT cover¶

Implementing disaggregated prefill/decode. Read-only.
Speculative decoding implementation (Medusa, EAGLE, Lookahead). Phase 36's territory; vocabulary only here.
Cross-region / cross-AZ inference routing. Production-ops territory; Phase 38.
Inference autoscaling on tokens/sec. Phase 38.
KV-cache offloading across workers (CXL-memory pools, etc.). Frontier; vocabulary only.

Next: lab/00-cloud-budget-and-tooling.md.