Skip to content

English · Español

02 — Tensor parallel, pipeline parallel, sequence parallel, expert parallel

🇪🇸 La familia "el modelo está partido entre las GPUs". Tensor parallel parte dentro de una capa (la matriz de pesos se reparte por filas o por columnas). Pipeline parallel parte entre capas (capa 1 en GPU0, capa 2 en GPU1). Sequence parallel parte la dimensión de secuencia. Expert parallel parte expertos de MoE. Cada uno minimiza un coste distinto.

Where the data-parallel family in theory/01-data-parallel-and-zero.md replicates (or shards-then-gathers) the model across workers, the model-parallel family splits the model itself. Different workers compute different parts of the same forward pass.

We cover four flavors: tensor parallel (TP), pipeline parallel (PP), sequence parallel, expert parallel. Of these, TP and PP are the ones Borja needs to read fluently. Sequence parallel and expert parallel get conceptual coverage.


Tensor parallel (TP) — Megatron-style

A single weight matrix is sharded across \(N\) workers. The two natural splits for a Linear(in_features=I, out_features=O) layer:

Column-parallel linear

The output dimension is split: each worker holds \(O/N\) output columns of \(W\). Computation:

  • Input: the same activation \(x \in \mathbb{R}^{B \times I}\) is replicated on every worker (so no comm before this layer).
  • Local compute: worker \(i\) computes \(y_i = x W_i \in \mathbb{R}^{B \times O/N}\) where \(W_i\) is the worker's column shard.
  • Output: each worker holds a different column shard of \(y \in \mathbb{R}^{B \times O}\). If the next layer is row-parallel, no comm here — the next layer expects this split.

Row-parallel linear

The input dimension is split: each worker holds \(I/N\) input rows of \(W\). Computation:

  • Input: the activation \(x\) must be pre-split — each worker holds \(x_i \in \mathbb{R}^{B \times I/N}\) (this matches what a column-parallel layer's output looks like).
  • Local compute: worker \(i\) computes \(y_i = x_i W_i \in \mathbb{R}^{B \times O}\) — full output dim, but only a partial sum.
  • Output: an all-reduce across workers produces the full \(y\).

The MLP pattern

A transformer MLP is two linears with an activation in between: Linear(d_model, d_ff) → GELU → Linear(d_ff, d_model). The Megatron pattern:

  • Layer 1: column-parallel — output is [B, d_ff/N] per worker.
  • GELU: element-wise, no comm.
  • Layer 2: row-parallel — input is [B, d_ff/N] per worker, output is full [B, d_model] after one all-reduce.

Total comm: one all-reduce per MLP. Volume: \(B \cdot d_{\text{model}} \cdot 4\) bytes (fp32) or \(\cdot 2\) (fp16). For a transformer with \(L\) layers and one MLP per layer, that's \(L\) all-reduces per forward pass (plus another \(L\) for backward).

The attention pattern

Q, K, V projections are column-parallel (output dim split: each worker gets n_heads / N heads). The output projection Linear(d_model, d_model) is row-parallel. Same pattern as MLP: 2 all-reduces per layer (forward + backward).

Sharding the embedding table

For the grammar tutor specifically: with ~600 forms (English + Spanish), the embedding table is tiny. But the pattern matters for the future-grew variant. The embedding nn.Embedding(vocab_size, d_model) shards along vocab_size:

  • Worker \(i\) holds embeddings for token IDs \([i \cdot V/N, (i+1) \cdot V/N)\).
  • A lookup: each worker gathers its slice, then all-reduce to combine.

For the grammar tutor at vocab=600, this is comically wasteful — the all-reduce overhead dwarfs the lookup itself. The lab does this anyway, as a teaching exercise, with the explicit note "this is overkill for our task; we're doing it to see the shape." Real workloads use this pattern when vocab × \(d_{\text{model}}\) exceeds a single GPU's memory (sometime around 600k tokens × \(d_{\text{model}} = 4096\) in fp32 ≈ 10 GB).


Pipeline parallel (PP)

Layers are partitioned into stages. Stage \(s\) lives on worker \(s\). A microbatch flows worker \(0 \to 1 \to 2 \to \ldots \to S-1\) during forward, then back during backward.

The bubble problem

Naively, while worker 0 does forward on microbatch 1, workers 1..\(S-1\) idle. Then worker 1 does forward on microbatch 1 while worker 0 starts forward on microbatch 2, etc. The "fill" and "drain" of the pipeline are idle time.

For \(S\) stages and \(M\) microbatches, the bubble fraction is:

\[\text{bubble} = \frac{S - 1}{M + S - 1}\]

Examples:

  • \(S=4, M=4\): bubble = 3/7 ≈ 43%. Half the cluster idles half the time.
  • \(S=4, M=16\): bubble = 3/19 ≈ 16%.
  • \(S=4, M=64\): bubble = 3/67 ≈ 4%.

Microbatch count must massively exceed stage count. This is the single most-violated rule in production PP rollouts.

1F1B scheduling

The naive schedule does all forwards, then all backwards. 1F1B (one-forward-one-backward) interleaves: as soon as a microbatch finishes forward on the last stage, it starts its backward. This reduces peak activation memory (you don't hold \(M\) microbatches' activations simultaneously) without changing the bubble.

Interleaved 1F1B

Megatron's interleaved schedule: each worker holds multiple non-contiguous stages. Worker 0 has layers 0–7 and 16–23; worker 1 has 8–15 and 24–31. Forward goes through twice the number of stages, halving each "stage's" compute, allowing more pipeline overlap. Reduces bubble by ~half at the cost of more comm.

The PP timeline diagram

Belongs in diagrams/. The lab generates it. The mermaid sketch:

time → → →
W0: [F1 F2 F3 F4 .. .. .. .. B4 B3 B2 B1]   <-- naive
W1: [.. F1 F2 F3 F4 .. .. B4 B3 B2 B1 ..]
W2: [.. .. F1 F2 F3 F4 B4 B3 B2 B1 .. ..]
W3: [.. .. .. F1 F2 F3 B3 B2 B1 .. .. ..]
       \________/         \______/
        fill bubble        drain bubble

with "F\(m\)" = forward of microbatch \(m\), "B\(m\)" = backward.


Sequence parallel

Activations grow as \(B \times L_{\text{seq}} \times d_{\text{model}}\). For long contexts \(L_{\text{seq}}\) becomes huge. Sequence parallel shards along \(L_{\text{seq}}\) — different workers hold different positions of the same sequence.

The catch: attention is not sequence-position-local. To compute attention at position \(i\), you need K/V at positions \(1..i\) (causal). Sequence parallel attention requires inter-worker comm during attention itself.

Variants exist: Megatron's sequence parallel (modest comm increase), Ring Attention (KV blocks rotate through workers in a ring), and DeepSpeed Ulysses (sequence-parallel attention with all-to-all comm). All hit a long-context ceiling that scales further than TP alone.

For the grammar tutor (max context ~32 tokens), sequence parallel is overkill. Conceptual coverage only.


Expert parallel

Used with Mixture-of-Experts (MoE). An MoE layer has \(E\) experts; each token is routed to top-\(k\) of them (typically \(k=2\)). Expert parallel assigns each expert to a different worker.

Tokens get routed via all-to-all comm: every worker sends its tokens to the worker holding the chosen expert, the expert computes, the results all-to-all back.

Expert parallel is the parallelism for MoE; it's why MoE works at scale. Phase 36 covers MoE in depth. Phase 35 mentions it for vocabulary.


The 3D parallelism summary

Production large-model training combines:

  • DP across nodes (outermost) — replicates the (already-sharded) model.
  • TP within a node (innermost) — uses NVLink's ~600 GB/s intra-node bandwidth for the chatty TP all-reduces.
  • PP between nodes when DP scaling saturates — moves layer-boundary activations between nodes (lower bandwidth needed than TP).

A common config for a 70B model on 8-node × 8-GPU cluster: DP=2, TP=8 (intra-node), PP=4 (across pairs of nodes). The math is unforgiving — get any of the three sizes wrong and throughput collapses.

For Phase 35: we cover the pieces. We do not implement the 3D combination. Megatron-LM and DeepSpeed do, and reading their config docs (lab 03) is the takeaway.

When is model-parallel right?

Situation Recommendation
Model fits on one GPU Don't use model parallel. Use DDP or single-GPU.
Model just barely doesn't fit (5–20% over) Try ZeRO-3 / FSDP first. Simpler, faster to wire up.
Model 2–8× too big for one GPU TP within a single node (intra-node NVLink).
Model too big for a node TP + PP, with TP intra-node and PP across nodes.
MoE with many experts Expert parallel, always.
Long context (≫ 8k tokens) Add sequence parallel on top.

For the grammar tutor in this curriculum:

  • Model fits on a calculator. Don't actually need TP.
  • Lab 02 runs TP on 2 cloud GPUs as a teaching exercise: shard the embedding table, observe the all-reduce, measure speedup vs single-GPU (it'll be a slowdown because the model is too small — that's part of the lesson).

What this phase does NOT cover

  • Hand-writing an interleaved-1F1B PP scheduler. PyTorch's torch.distributed.pipeline.sync exists; Megatron's megatron/core/pipeline_parallel/schedules.py exists. We read one, we don't write one.
  • Ring Attention / DeepSpeed Ulysses implementations. Conceptual only.
  • MoE expert routing implementation. Phase 36 territory.
  • Communication compression (1-bit Adam, PowerSGD). Vocabulary only.

Next: theory/03-collectives-and-cost.md.