English · Español

05 — MoE routing math + load-balancing loss; Mamba's constant-memory intuition¶

🇪🇸 Mixture-of-Experts es un router + N expertos: cada token elige top-k expertos. Si el router colapsa (todo a un experto), pierdes la ventaja entera. El "auxiliary loss" es un regulador suave que mantiene los expertos cargados de forma pareja. Mamba/SSM: la memoria por paso es constante en la longitud de secuencia porque mantiene un estado oculto de tamaño fijo, no una caché que crece.

Part 1 — Mixture-of-Experts routing¶

The architecture in one sentence¶

A Mixture-of-Experts (MoE) layer replaces a single feed-forward block with $E$ parallel experts plus a router that, for each token, picks the top-$k$ experts to evaluate (typically $k = 1$ or $k = 2$). The router is a tiny linear: gates = softmax(W_r · x), shape (seq_len, E).

For each token $t$ with hidden $x_t$, the layer output is:

\[y_t = \sum_{e \in \text{TopK}(\text{gates}_t)} \text{gates}_{t,e} \cdot \text{Expert}_e(x_t)\]

The gain: only $k$ of $E$ experts run per token, so compute is $k/E$ of dense, while parameter count is $\sim E \cdot$ dense. More params, similar FLOPs. That trade is the whole game of frontier MoEs (Mixtral, Switch, GShard).

The routing math, made explicit¶

For batch size $B$ and sequence length $L$, with $T = B \cdot L$ tokens:

Logits. $G \in \mathbb{R}^{T \times E} = X W_r$ where $W_r \in \mathbb{R}^{d \times E}$.
Gate softmax. $\hat{G} = \text{softmax}(G)$ along expert axis.
Top-k mask. For each row $t$, keep the $k$ largest entries; zero the rest. Re-normalize (typically not re-softmax, just rescale to sum to 1 on the kept entries; this is the "Switch" convention, but variants exist).
Dispatch. Group tokens by expert: expert $e$ receives all tokens whose mask is non-zero at column $e$.
Process. Each expert runs its FFN on its assigned tokens.
Combine. Scatter outputs back, weight by the gate value, sum.

The dispatch/combine pair is where systems engineering lives: implementing it as a permutation + ungather is the standard MoE_all-to-all collective in distributed MoEs (Switch Transformer §4).

The pathology: router collapse¶

The router is a learned function. There is nothing structurally preventing it from learning to always pick expert 0. If that happens:

One expert sees 100% of tokens (overflows its capacity, may drop tokens).
$E - 1$ experts see 0 tokens, get zero gradient, never train.
Effective param count collapses to the equivalent of a single dense FFN.

This is router collapse and it is the failure mode the auxiliary loss exists to prevent.

The auxiliary loss¶

For each expert $e$, define two statistics over the batch:

$f_e$ = fraction of tokens routed to $e$ via top-k (a discrete-mask measure).
$P_e$ = mean of $\hat{G}_{:,e}$ across the batch (a soft-gate measure).

The Switch Transformer auxiliary loss is:

\[\mathcal{L}_{\text{aux}} = \alpha \cdot E \cdot \sum_{e=1}^{E} f_e \cdot P_e\]

Intuitively, $f_e \cdot P_e$ is minimized when both terms equal $1/E$ (uniform routing), and the sum $\sum_e f_e P_e \ge 1/E$ by Cauchy-Schwarz with equality at uniform. The $\alpha$ coefficient is small (~0.01); just enough to break router collapse without dominating the main loss.

Why both $f$ and $P$? $P$ alone is differentiable (gate softmax) but is gamed by the model — it can keep $P$ uniform while $f$ stays peaked. $f$ alone is non-differentiable (it's a top-k indicator). The product penalizes the correlation between expert assignment and gate confidence — exactly the failure mode of router collapse.

Expert capacity and dropped tokens¶

Each expert has a finite capacity $C = \lceil \kappa \cdot T \cdot k / E \rceil$ where $\kappa$ is the "capacity factor" (~1.0 to 1.25). If more than $C$ tokens are routed to expert $e$, the extras are dropped — they skip this MoE layer (residual connection still preserves the input). This avoids unbounded straggler latency in the distributed dispatch.

Capacity is a system constraint, not a learned one. Setting $\kappa$ too low loses tokens; too high wastes memory. The standard recipe is $\kappa = 1.25$ for training, $\kappa = 2.0$ for inference (no straggler tolerance).

§A13-scoped intuition (not a real MoE)¶

We do not add a real MoE to the grammar tutor (the model is 500k params; the architecture would be over-engineered). But the lab 00-moe-on-grammar-tutor.md walks through a 4-expert toy MoE on the 600-form corpus to make the routing math visceral. The router collapse failure mode is the /break exercise.

Part 2 — Mamba / SSMs and the constant-memory claim¶

Why attention scales O(N²) in memory¶

Self-attention computes a $N \times N$ attention matrix; even with the KV-cache that converts inference to $O(N)$ in compute per step, the cache itself grows linearly: at step $n$ you store $n$ keys and $n$ values per layer per head. For long contexts, the cache dominates GPU memory.

State-space models in one diagram¶

A state-space model (SSM) maintains a fixed-size hidden state $h_t \in \mathbb{R}^{d_\text{state}}$ and evolves it via a linear recurrence:

\[h_t = A h_{t-1} + B x_t$$ $$y_t = C h_t + D x_t\]

The output at step $t$ depends only on $x_t$ and $h_{t-1}$ — no history beyond the current state. Memory cost: $O(d_\text{state})$ regardless of sequence length. That is the constant-memory claim.

What was wrong with classical RNNs?¶

Classical RNNs had the same recurrent form but: (a) the recurrence was non-linear (tanh), so it could not be parallelized over time; (b) the matrix $A$ had no structure, so long-term dependencies vanished/exploded.

Mamba (and S4) fix both:

The recurrence is linear; sequential application is mathematically equivalent to a convolution, which has fast parallel algorithms (associative scan, FFT for fixed-coefficient form).
$A$ is parameterized so its eigenvalues are well-behaved (HiPPO initialization, S4's diagonal-plus-low-rank structure). This is the "selective scan" Mamba does — and the math is what makes long-range dependencies actually train.

Inference memory comparison¶

For a model with $L$ layers, $d_\text{state} = 16$ (Mamba default), $d_\text{model} = 1024$:

Architecture	Memory per step (cache)	At seq_len = 8192 (bytes, fp16)
Transformer (KV)	$L \cdot 2 \cdot d_\text{model} \cdot n$ tokens	$L \cdot 2 \cdot 1024 \cdot 8192 \cdot 2 = 33L$ MB
Mamba (SSM state)	$L \cdot d_\text{state} \cdot d_\text{model}$	$L \cdot 16 \cdot 1024 \cdot 2 = 32L$ KB

Three orders of magnitude difference at this context length. That is what "constant memory in sequence length" buys.

What does Mamba give up?¶

It is not a free lunch:

Random access into history is gone. A transformer can attend back to position 17 in detail; Mamba can only see what the hidden state happens to encode about position 17. For tasks that require precise recall (needle-in-haystack), Mamba underperforms; for tasks that just need compressed context (language modeling), it matches or beats transformers per-FLOP.
In-context learning is weaker. Few-shot prompting depends partly on the precise-recall capability transformers have.
Hybrid architectures (Mamba + attention layers, e.g. Jamba) keep the best of both — most layers are Mamba (cheap), a few are attention (precise recall).

§A13-scoped intuition¶

For the grammar tutor, contexts are ≤ 64 tokens; the constant-memory argument is irrelevant. We still cover it because the architectural decision (when to use what) is part of the engineer's vocabulary, and the survey lab walks through a small Mamba block to make the recurrence concrete.

What this chapter does NOT cover¶

MoE all-to-all kernel implementations (Tutel, FasterMoE). Production / X1 territory.
DeepSeek's MLA (Multi-head Latent Attention). Covered separately in 02-mla.md.
The HiPPO theory behind S4's matrix $A$. The closed-form derivation lives in the S4 paper.
Hybrid Mamba+Attention scheduling. Jamba paper covers this.

Reference¶

Fedus et al., "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity" (JMLR 2022). The auxiliary loss formula and capacity-factor discipline come from here.
Gu & Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" (2023). The selective-scan derivation that makes SSMs competitive.
Lieber et al., "Jamba: A Hybrid Transformer-Mamba Language Model" (2024). The Mamba+attention recipe in practice.

Next: ../lab/00-moe-on-grammar-tutor.md or ../lab/02-mamba-walkthrough.md.

Architecture	Memory per step (cache)	At seq_len = 8192 (bytes, fp16)
Transformer (KV)	\(L \cdot 2 \cdot d_\text{model} \cdot n\) tokens	\(L \cdot 2 \cdot 1024 \cdot 8192 \cdot 2 = 33L\) MB
Mamba (SSM state)	\(L \cdot d_\text{state} \cdot d_\text{model}\)	\(L \cdot 16 \cdot 1024 \cdot 2 = 32L\) KB