Skip to content

English · Español

01 — Mixture of Experts (MoE)

🇪🇸 MoE separa "cuántos parámetros tiene el modelo" de "cuánto cómputo usa cada token". El modelo tiene \(E\) "expertos" (capas FFN paralelas); un router asigna cada token a \(k\) de ellos. Los parámetros crecen con \(E\); los FLOPs por token crecen con \(k\). Ese desacoplamiento es la idea entera.

The dense transformer has a known scaling problem: increasing the model's capability requires increasing parameter count, and parameter count drives both training cost (every parameter sees gradients every step) and inference compute (every parameter is multiplied with activations every forward).

Mixture of Experts decouples these. Add \(E\) parallel FFN blocks ("experts") per layer; for each token, route it to only \(k\) of them (typically \(k = 1\) or \(k = 2\)). Parameter count grows ∝ \(E\); FLOPs per token grow ∝ \(k\). If \(E \gg k\), you get a sparse model that has the capacity of \(E\) FFNs but the compute of \(k\) FFNs.


The math

Parameter count

A dense FFN: \(|W_{\text{dense}}| = 2 \cdot d_{\text{model}} \cdot d_{\text{ff}}\).

An MoE layer with \(E\) experts: \(|W_{\text{moe}}| = E \cdot |W_{\text{dense}}| + |W_{\text{gate}}|\), where the gate is a \(d_{\text{model}} \to E\) linear, contributing \(d_{\text{model}} \cdot E\) params (small).

For \(d_{\text{model}} = 4096, d_{\text{ff}} = 16384\), dense FFN is 134M params. MoE with \(E = 8\) is ~1B params per layer — same compute per token as the dense FFN, 8× the capacity.

Routing

For input token activation \(x \in \mathbb{R}^{d_{\text{model}}}\):

\[\text{gate}(x) = \mathrm{softmax}(W_g x) \in \mathbb{R}^E\]

Take top-\(k\) indices and renormalize over them. Token's output:

\[\text{MoE}(x) = \sum_{i \in \text{top}_k(\text{gate}(x))} \text{gate}(x)_i \cdot \text{Expert}_i(x)\]

For \(k = 2\): two expert forward passes, weighted sum.

Load-balancing loss

Without intervention, the router collapses: a few "popular" experts get most tokens; many experts are dead weight. The cure is an auxiliary loss that pulls the routing distribution toward uniform.

Let $f_i = $ fraction of tokens in the batch routed to expert \(i\), $p_i = $ mean gate probability for expert \(i\) over the batch. The Switch Transformer loss:

\[\mathcal{L}_{\text{aux}} = E \cdot \sum_i f_i \cdot p_i\]

When routing is uniform, \(f_i = p_i = 1/E\), and \(\mathcal{L}_{\text{aux}} = E \cdot E \cdot (1/E)^2 = 1\). (The factor of \(E\) keeps the loss scale-invariant.) When routing is imbalanced — say all tokens go to expert 0 — \(f_0 = 1, p_0 = 1\), all others zero, so \(\mathcal{L}_{\text{aux}} = E\). The loss rises by \(E\)× under collapse, providing strong gradient back to the gate weights.

Total loss: \(\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{language}} + \alpha \cdot \mathcal{L}_{\text{aux}}\), with \(\alpha \approx 0.01\).

Expert capacity and dropping

If exactly \(k\) tokens go to every expert in a balanced batch of \(N\) tokens with \(k\)-routing: each expert sees \(N \cdot k / E\) tokens. But token assignment is hard — natural language has skewed distributions. To bound memory per expert, MoE implementations set a capacity factor \(C\): each expert accepts at most \(C \cdot N \cdot k / E\) tokens, dropping the rest (zero output for dropped tokens, which the residual stream provides cover for).

Capacity factor \(C = 1.0\): tight, lots of drops. \(C = 1.25\): typical. \(C = 2.0\): loose, memory-hungry.


Variants

  • Switch Transformer (Google). Top-1 routing (\(k=1\)). Single-expert per token. Maximum sparsity.
  • Mixtral (Mistral). Top-2 routing, 8 experts. The "8 × 7B" architecture.
  • DeepSeek-V2 MoE. Top-6 routing of 64 fine-grained experts + 2 shared experts (always-on). More expressive routing.
  • Soft MoE (DeepMind). Continuous routing — every expert sees a combination of tokens weighted by the gate. Differentiable, no discrete top-k, no auxiliary loss. Higher compute, smoother optimization.

The choice of router and capacity is an active research area. The Switch + capacity-factor + load-balance recipe is the standard baseline.


Expert parallelism

At training scale, MoE requires a specialized parallelism strategy. Experts are placed on different workers; tokens are routed via all-to-all comm: every worker sends its tokens to the worker holding the chosen expert. The expert computes; results all-to-all back.

Comm cost per token per layer: 2 all-to-alls of \(d_{\text{model}}\) bytes per token.

For our microscopic grammar tutor: no expert parallelism needed (everything on one CPU). The all-to-all primitives are Phase 35 vocabulary; here we just acknowledge that MoE serving at scale is a different topology.


Failure modes

  1. Router collapse. Most tokens go to a few experts. Cured by the aux loss but the cure is imperfect — chronic mild imbalance is normal.
  2. Token dropping at low capacity factor. Some tokens get zero contribution from the MoE layer. The residual stream covers it but quality suffers in long-tail cases.
  3. Training instability. MoE loss curves have sharper spikes than dense. The aux loss helps; aggressive weight decay or low LR sometimes needed.
  4. Expert specialization isn't always linguistic. Studies show MoE experts often specialize on syntactic / tokenization features more than semantic ones — e.g., one expert per token type. Cool but unsurprising; experts are trained from scratch with no inductive bias toward semantic categories.

Would MoE help the grammar tutor?

Let's run the test from theory/00-motivation.md:

  1. MoE's bottleneck: "capacity growing faster than compute." We want more model capacity without paying more FLOPs per token.
  2. Does the grammar tutor have it? No. The tutor is ~500k params and trains in seconds on CPU. Capacity is not the bottleneck — the corpus is so small that capacity overshoots already (you'll see this in Phase 19's overfitting curves).
  3. What would MoE cost? New code: ~150 LOC for routing + load-balance loss. New failure modes: router collapse, dropped tokens. New tuning: \(E, k, C, \alpha\).
  4. Verdict: Never.

The lab 00 (lab/00-moe-on-grammar-tutor.md) does run a 2-expert MoE as a teaching exercise, not because it helps. The honest finding will be: same perplexity as dense, more parameters, more training instability. Reading that result and naming why is the whole point.

When MoE would help (the counterfactual)

If the grammar tutor were a 5-language version (English + Spanish + French + German + Italian) and the corpus were 100× larger:

  • Each language has its own conjugation patterns. Router might learn to specialize one expert per language family.
  • Capacity becomes a bottleneck — dense models hit a quality ceiling.
  • Then MoE earns its keep.

That counterfactual is the future hypothesis, not the current scope. Phase 36 noting it explicitly is honest. Don't extend the curriculum to fit the technique; let the technique stay shelved until the scope demands it.


What this phase does NOT cover

  • Training a real MoE. Conceptually doable in experiments/36-moe-on-grammar-tutor/, but it's a stub. The 2-expert grammar-tutor MoE is for seeing routing fire, not for production training.
  • Expert parallelism implementation. Phase 35 vocabulary only; here we just acknowledge it.
  • Continuous (Soft) MoE. Concept mentioned; not derived in math depth.
  • MoE on inference servers (e.g., Mixtral on vLLM). Conceptual; Phase 33+34 already taught the serving pieces.
  • Routing analysis tools (which token went to which expert). Mentioned in lab 00 as an optional extra.

Next: theory/02-mla.md.