English · Español
Lab 00 — A 2-expert MoE variant of MiniGPT-grammar (and why it doesn't help)¶
Goal: swap the FFN in one transformer block for a 2-expert MoE. Train locally on CPU. Honestly compare to the dense baseline. Confirm the negative result: MoE doesn't help the grammar tutor.
Estimated time: 3–4 hours.
Prereq: Phase 17 MiniGPT-grammar trained; Phase 18 training loop in
src/minitrain/; Phase 19 inspection hooks for monitoring loss. Borja has readtheory/01-moe.md.
What you produce¶
A directory experiments/36-moe-on-grammar-tutor/ containing:
moe_block.py— local (experiment-scoped) module with aMoEBlockclass implementing routing + 2 expert FFNs + load-balancing aux loss. Per the phase plan, this lives in the experiment directory (or assrc/minimodel/_moe_block.pyif Borja prefers — single optional file under an existing module, not a new top-level module).train_moe.py— training script: builds a MiniGPT-grammar variant with the MoE block replacing one or two of the dense FFNs.compare.py— runs the baseline (Phase 18 dense MiniGPT) and the MoE variant, side-by-side: same data, same seed, same step count.manifest.json— final metrics, parameter counts, perplexity, training time.findings.md— the honest report on the comparison.
TODOs¶
Block A — implement the MoE block¶
# experiments/36-moe-on-grammar-tutor/moe_block.py
# Skeleton — Borja writes the body. ~150 LOC target.
import torch
from torch import nn
import torch.nn.functional as F
class MoEBlock(nn.Module):
"""Top-2 routing of E experts. Returns (output, aux_loss).
For the grammar tutor at d_model=64, d_ff=256, set E=2, k=2 to start
(every token goes to every expert — degenerate, mostly for testing).
Then E=4, k=2 for the actual experiment.
"""
def __init__(self, d_model: int, d_ff: int, num_experts: int, top_k: int):
super().__init__()
self.gate = nn.Linear(d_model, num_experts, bias=False)
self.experts = nn.ModuleList([
nn.Sequential(
nn.Linear(d_model, d_ff),
nn.GELU(),
nn.Linear(d_ff, d_model),
)
for _ in range(num_experts)
])
self.top_k = top_k
self.num_experts = num_experts
def forward(self, x):
# x: [B, T, d_model]
# 1. compute gate logits, softmax, top-k indices and weights
# 2. for each expert, gather the tokens assigned to it
# 3. compute the expert's FFN on its assigned tokens
# 4. scatter back into the output, weighted by gate
# 5. compute auxiliary load-balance loss (theory/01-moe.md §2)
# 6. return (output [B, T, d_model], aux_loss [scalar])
...
Implementation notes:
- No capacity dropping. At grammar-tutor scale every token can fit. Capacity factor = ∞.
- Differentiable top-k. PyTorch's
torch.topkis not differentiable through the indices, but the gate weights (the softmax values) flow gradients. That's enough. - Aux loss scale. Start with \(\alpha = 0.01\). Tune if routing collapses.
Tests in tests/minimodel/test_moe_block.py (if you place the block in src/minimodel/):
-
test_routing_is_top_k— every token gets exactly \(k\) non-zero gate weights. -
test_aux_loss_is_one_for_uniform— for uniform routing, aux loss equals 1. -
test_aux_loss_grows_under_imbalance— feed all tokens to expert 0; aux loss should be \(E\). -
test_total_params_grow_as_E— param count of MoE with \(E\) experts ≈ \(E\) × dense FFN.
Block B — modify the model¶
Take the Phase 17 MiniGPT-grammar and replace exactly one block's FFN with the MoE block. Single FFN replacement is enough — replacing all of them adds proportional cost without proportional insight at this scale.
The model factory should accept --moe-layer=1 (which block index gets the MoE) and --num-experts=4 --top-k=2.
Block C — the comparison protocol¶
For a fair comparison:
- Same seed (e.g., 36000).
- Same data shards (Phase 12 corpus splits).
- Same LR schedule (Phase 18 cosine + warmup).
- Same step count (e.g., 5000 steps).
- Same batch size.
- Same eval probe set (Phase 20 eval harness).
Run both models. Record:
- Final training loss.
- Final validation perplexity.
- Total parameters.
- Training wall time.
- Per-expert load distribution (the MoE) — mean and stddev of token assignment across the 4 experts.
Block D — the routing analysis¶
For the MoE model, instrument the forward pass to log which expert(s) each token gets routed to. After training, produce two analysis artifacts:
- Per-token expert assignment histogram — for a representative validation batch, plot the distribution of routing. Check: is it uniform? Skewed? Are some experts dead?
- Per-token-class expert assignment — group tokens by their corpus label (e.g., "verb-infinitive", "verb-past", "subject-pronoun"). Do experts specialize on linguistic categories, or on something arbitrary like token frequency?
Commit both as plots in experiments/36-moe-on-grammar-tutor/routing-analysis.png and expert-by-class.png.
Block E — the honest findings¶
findings.md — 300–600 words. Address each of:
- Did the MoE variant achieve lower perplexity than the dense baseline? (Expected: no, or marginally — within noise.)
- How many more parameters does the MoE have? (Expected: ~2× the FFN's worth.)
- Was the load-balance loss working — did experts get roughly equal token counts?
- Did experts specialize linguistically, or arbitrarily?
- Did training take longer? Why?
- Bottom line: should the grammar tutor adopt MoE? Answer: no, with reasoning.
Be honest. If by some chance the MoE does beat the baseline (it can happen due to noise, regularization effects, or a happy random init), document it but also report the variance: rerun both with 3 different seeds and report mean ± std. Don't cherry-pick.
Block F — manifest¶
experiments/36-moe-on-grammar-tutor/manifest.json:
{
"seed": 36000,
"lab": "00-moe-on-grammar-tutor",
"num_experts": 4,
"top_k": 2,
"moe_layer_index": 1,
"alpha_aux_loss": 0.01,
"steps": 5000,
"dense_baseline": {
"params": "<filled in>",
"final_val_ppl": "<filled in>",
"wall_time_s": "<filled in>"
},
"moe_variant": {
"params": "<filled in>",
"final_val_ppl": "<filled in>",
"wall_time_s": "<filled in>",
"load_balance_stddev": "<filled in>",
"experts_dead_count": "<filled in>"
},
"verdict_for_grammar_tutor": "<adopt / defer / never>",
"lesson_notes": "<the bottom-line paragraph from findings.md>"
}
Constraints¶
- CPU only. No cloud spend.
- 5000 steps maximum. Convergence is fast on this corpus; more steps don't change the answer.
- No exotic optimizers. AdamW from Phase 18. Same LR schedule as the baseline.
- No special MoE tricks. No expert dropout, no expert noise injection, no fancy routing. Vanilla top-2 + aux loss. The point is to see MoE, not to overfit it to win.
- Bench across 3 seeds, not 1. If the variance overwhelms the dense-vs-MoE delta, report that honestly. Don't claim a win that's within noise.
Stop conditions¶
You're done when:
experiments/36-moe-on-grammar-tutor/{moe_block.py, train_moe.py, compare.py, findings.md, manifest.json, routing-analysis.png, expert-by-class.png}all exist.findings.mdanswers all six questions in Block E.manifest.jsonhas the comparison numbers.- The verdict for the grammar tutor is
deferornever— and if it'sadopt, you have 3-seed evidence showing the gain exceeds variance.
Hint of last resort¶
If routing collapses (one expert gets >80% of tokens):
- First check: is
alpha_aux_losstoo low? Try 0.05. - Second check: gate weights initialized to zero? Initialize \(W_g\) with small random values (
std=0.02). - Third check: top-\(k\) vs \(E\) — for \(E=2, k=2\), routing is degenerate (every token goes to every expert), and the aux loss has no leverage. Use \(E=4, k=2\).
If MoE training is unstable (loss spikes): lower the LR by 2×. MoE is more sensitive to LR than dense.
If MoE beats the baseline by a comfortable margin (>10% perplexity reduction with low variance): take it seriously, but also check that you didn't accidentally make the dense baseline weaker than Phase 18's. The comparison must be apples-to-apples.
When to consult solutions/¶
After committing findings.md. Solution lives in solutions/00-moe-ref.md — written at phase open after Borja's Phase 17 + 18 are in. The reference solution includes the expected ranges of the comparison numbers (within bands of variance) so Borja can sanity-check his run.
Next lab: lab/01-mla-math-exercise.md.