English · Español
05 — MoE routing math + load-balancing loss; Mamba's constant-memory intuition¶
🇪🇸 Mixture-of-Experts es un router + N expertos: cada token elige top-k expertos. Si el router colapsa (todo a un experto), pierdes la ventaja entera. El "auxiliary loss" es un regulador suave que mantiene los expertos cargados de forma pareja. Mamba/SSM: la memoria por paso es constante en la longitud de secuencia porque mantiene un estado oculto de tamaño fijo, no una caché que crece.
Part 1 — Mixture-of-Experts routing¶
The architecture in one sentence¶
A Mixture-of-Experts (MoE) layer replaces a single feed-forward block with \(E\) parallel experts plus a router that, for each token, picks the top-\(k\) experts to evaluate (typically \(k = 1\) or \(k = 2\)). The router is a tiny linear: gates = softmax(W_r · x), shape (seq_len, E).
For each token \(t\) with hidden \(x_t\), the layer output is:
The gain: only \(k\) of \(E\) experts run per token, so compute is \(k/E\) of dense, while parameter count is \(\sim E \cdot\) dense. More params, similar FLOPs. That trade is the whole game of frontier MoEs (Mixtral, Switch, GShard).
The routing math, made explicit¶
For batch size \(B\) and sequence length \(L\), with \(T = B \cdot L\) tokens:
- Logits. \(G \in \mathbb{R}^{T \times E} = X W_r\) where \(W_r \in \mathbb{R}^{d \times E}\).
- Gate softmax. \(\hat{G} = \text{softmax}(G)\) along expert axis.
- Top-k mask. For each row \(t\), keep the \(k\) largest entries; zero the rest. Re-normalize (typically not re-softmax, just rescale to sum to 1 on the kept entries; this is the "Switch" convention, but variants exist).
- Dispatch. Group tokens by expert: expert \(e\) receives all tokens whose mask is non-zero at column \(e\).
- Process. Each expert runs its FFN on its assigned tokens.
- Combine. Scatter outputs back, weight by the gate value, sum.
The dispatch/combine pair is where systems engineering lives: implementing it as a permutation + ungather is the standard MoE_all-to-all collective in distributed MoEs (Switch Transformer §4).
The pathology: router collapse¶
The router is a learned function. There is nothing structurally preventing it from learning to always pick expert 0. If that happens:
- One expert sees 100% of tokens (overflows its capacity, may drop tokens).
- \(E - 1\) experts see 0 tokens, get zero gradient, never train.
- Effective param count collapses to the equivalent of a single dense FFN.
This is router collapse and it is the failure mode the auxiliary loss exists to prevent.
The auxiliary loss¶
For each expert \(e\), define two statistics over the batch:
- \(f_e\) = fraction of tokens routed to \(e\) via top-k (a discrete-mask measure).
- \(P_e\) = mean of \(\hat{G}_{:,e}\) across the batch (a soft-gate measure).
The Switch Transformer auxiliary loss is:
Intuitively, \(f_e \cdot P_e\) is minimized when both terms equal \(1/E\) (uniform routing), and the sum \(\sum_e f_e P_e \ge 1/E\) by Cauchy-Schwarz with equality at uniform. The \(\alpha\) coefficient is small (~0.01); just enough to break router collapse without dominating the main loss.
Why both \(f\) and \(P\)? \(P\) alone is differentiable (gate softmax) but is gamed by the model — it can keep \(P\) uniform while \(f\) stays peaked. \(f\) alone is non-differentiable (it's a top-k indicator). The product penalizes the correlation between expert assignment and gate confidence — exactly the failure mode of router collapse.
Expert capacity and dropped tokens¶
Each expert has a finite capacity \(C = \lceil \kappa \cdot T \cdot k / E \rceil\) where \(\kappa\) is the "capacity factor" (~1.0 to 1.25). If more than \(C\) tokens are routed to expert \(e\), the extras are dropped — they skip this MoE layer (residual connection still preserves the input). This avoids unbounded straggler latency in the distributed dispatch.
Capacity is a system constraint, not a learned one. Setting \(\kappa\) too low loses tokens; too high wastes memory. The standard recipe is \(\kappa = 1.25\) for training, \(\kappa = 2.0\) for inference (no straggler tolerance).
§A13-scoped intuition (not a real MoE)¶
We do not add a real MoE to the grammar tutor (the model is 500k params; the architecture would be over-engineered). But the lab 00-moe-on-grammar-tutor.md walks through a 4-expert toy MoE on the 600-form corpus to make the routing math visceral. The router collapse failure mode is the /break exercise.
Part 2 — Mamba / SSMs and the constant-memory claim¶
Why attention scales O(N²) in memory¶
Self-attention computes a \(N \times N\) attention matrix; even with the KV-cache that converts inference to \(O(N)\) in compute per step, the cache itself grows linearly: at step \(n\) you store \(n\) keys and \(n\) values per layer per head. For long contexts, the cache dominates GPU memory.
State-space models in one diagram¶
A state-space model (SSM) maintains a fixed-size hidden state \(h_t \in \mathbb{R}^{d_\text{state}}\) and evolves it via a linear recurrence:
The output at step \(t\) depends only on \(x_t\) and \(h_{t-1}\) — no history beyond the current state. Memory cost: \(O(d_\text{state})\) regardless of sequence length. That is the constant-memory claim.
What was wrong with classical RNNs?¶
Classical RNNs had the same recurrent form but: (a) the recurrence was non-linear (tanh), so it could not be parallelized over time; (b) the matrix \(A\) had no structure, so long-term dependencies vanished/exploded.
Mamba (and S4) fix both:
- The recurrence is linear; sequential application is mathematically equivalent to a convolution, which has fast parallel algorithms (associative scan, FFT for fixed-coefficient form).
- \(A\) is parameterized so its eigenvalues are well-behaved (HiPPO initialization, S4's diagonal-plus-low-rank structure). This is the "selective scan" Mamba does — and the math is what makes long-range dependencies actually train.
Inference memory comparison¶
For a model with \(L\) layers, \(d_\text{state} = 16\) (Mamba default), \(d_\text{model} = 1024\):
| Architecture | Memory per step (cache) | At seq_len = 8192 (bytes, fp16) |
|---|---|---|
| Transformer (KV) | \(L \cdot 2 \cdot d_\text{model} \cdot n\) tokens | \(L \cdot 2 \cdot 1024 \cdot 8192 \cdot 2 = 33L\) MB |
| Mamba (SSM state) | \(L \cdot d_\text{state} \cdot d_\text{model}\) | \(L \cdot 16 \cdot 1024 \cdot 2 = 32L\) KB |
Three orders of magnitude difference at this context length. That is what "constant memory in sequence length" buys.
What does Mamba give up?¶
It is not a free lunch:
- Random access into history is gone. A transformer can attend back to position 17 in detail; Mamba can only see what the hidden state happens to encode about position 17. For tasks that require precise recall (needle-in-haystack), Mamba underperforms; for tasks that just need compressed context (language modeling), it matches or beats transformers per-FLOP.
- In-context learning is weaker. Few-shot prompting depends partly on the precise-recall capability transformers have.
- Hybrid architectures (Mamba + attention layers, e.g. Jamba) keep the best of both — most layers are Mamba (cheap), a few are attention (precise recall).
§A13-scoped intuition¶
For the grammar tutor, contexts are ≤ 64 tokens; the constant-memory argument is irrelevant. We still cover it because the architectural decision (when to use what) is part of the engineer's vocabulary, and the survey lab walks through a small Mamba block to make the recurrence concrete.
What this chapter does NOT cover¶
- MoE all-to-all kernel implementations (Tutel, FasterMoE). Production / X1 territory.
- DeepSeek's MLA (Multi-head Latent Attention). Covered separately in
02-mla.md. - The HiPPO theory behind S4's matrix \(A\). The closed-form derivation lives in the S4 paper.
- Hybrid Mamba+Attention scheduling. Jamba paper covers this.
Reference¶
- Fedus et al., "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity" (JMLR 2022). The auxiliary loss formula and capacity-factor discipline come from here.
- Gu & Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" (2023). The selective-scan derivation that makes SSMs competitive.
- Lieber et al., "Jamba: A Hybrid Transformer-Mamba Language Model" (2024). The Mamba+attention recipe in practice.
Next: ../lab/00-moe-on-grammar-tutor.md or ../lab/02-mamba-walkthrough.md.