English · Español

Lab 00 — A 2-expert MoE variant of MiniGPT-grammar (and why it doesn't help)¶

Goal: swap the FFN in one transformer block for a 2-expert MoE. Train locally on CPU. Honestly compare to the dense baseline. Confirm the negative result: MoE doesn't help the grammar tutor.

Estimated time: 3–4 hours.

Prereq: Phase 17 MiniGPT-grammar trained; Phase 18 training loop in src/minitrain/; Phase 19 inspection hooks for monitoring loss. Borja has read theory/01-moe.md.

What you produce¶

A directory experiments/36-moe-on-grammar-tutor/ containing:

moe_block.py — local (experiment-scoped) module with a MoEBlock class implementing routing + 2 expert FFNs + load-balancing aux loss. Per the phase plan, this lives in the experiment directory (or as src/minimodel/_moe_block.py if Borja prefers — single optional file under an existing module, not a new top-level module).
train_moe.py — training script: builds a MiniGPT-grammar variant with the MoE block replacing one or two of the dense FFNs.
compare.py — runs the baseline (Phase 18 dense MiniGPT) and the MoE variant, side-by-side: same data, same seed, same step count.
manifest.json — final metrics, parameter counts, perplexity, training time.
findings.md — the honest report on the comparison.

TODOs¶

Block A — implement the MoE block¶

# experiments/36-moe-on-grammar-tutor/moe_block.py
# Skeleton — Borja writes the body. ~150 LOC target.

import torch
from torch import nn
import torch.nn.functional as F

class MoEBlock(nn.Module):
    """Top-2 routing of E experts. Returns (output, aux_loss).

    For the grammar tutor at d_model=64, d_ff=256, set E=2, k=2 to start
    (every token goes to every expert — degenerate, mostly for testing).
    Then E=4, k=2 for the actual experiment.
    """
    def __init__(self, d_model: int, d_ff: int, num_experts: int, top_k: int):
        super().__init__()
        self.gate = nn.Linear(d_model, num_experts, bias=False)
        self.experts = nn.ModuleList([
            nn.Sequential(
                nn.Linear(d_model, d_ff),
                nn.GELU(),
                nn.Linear(d_ff, d_model),
            )
            for _ in range(num_experts)
        ])
        self.top_k = top_k
        self.num_experts = num_experts

    def forward(self, x):
        # x: [B, T, d_model]
        # 1. compute gate logits, softmax, top-k indices and weights
        # 2. for each expert, gather the tokens assigned to it
        # 3. compute the expert's FFN on its assigned tokens
        # 4. scatter back into the output, weighted by gate
        # 5. compute auxiliary load-balance loss (theory/01-moe.md §2)
        # 6. return (output [B, T, d_model], aux_loss [scalar])
        ...

Implementation notes:

No capacity dropping. At grammar-tutor scale every token can fit. Capacity factor = ∞.
Differentiable top-k. PyTorch's torch.topk is not differentiable through the indices, but the gate weights (the softmax values) flow gradients. That's enough.
Aux loss scale. Start with \(\alpha = 0.01\). Tune if routing collapses.

Tests in tests/minimodel/test_moe_block.py (if you place the block in src/minimodel/):

test_routing_is_top_k — every token gets exactly \(k\) non-zero gate weights.
test_aux_loss_is_one_for_uniform — for uniform routing, aux loss equals 1.
test_aux_loss_grows_under_imbalance — feed all tokens to expert 0; aux loss should be \(E\).
test_total_params_grow_as_E — param count of MoE with \(E\) experts ≈ \(E\) × dense FFN.

Block B — modify the model¶

Take the Phase 17 MiniGPT-grammar and replace exactly one block's FFN with the MoE block. Single FFN replacement is enough — replacing all of them adds proportional cost without proportional insight at this scale.

The model factory should accept --moe-layer=1 (which block index gets the MoE) and --num-experts=4 --top-k=2.

Block C — the comparison protocol¶

For a fair comparison:

Same seed (e.g., 36000).
Same data shards (Phase 12 corpus splits).
Same LR schedule (Phase 18 cosine + warmup).
Same step count (e.g., 5000 steps).
Same batch size.
Same eval probe set (Phase 20 eval harness).

Run both models. Record:

Final training loss.
Final validation perplexity.
Total parameters.
Training wall time.
Per-expert load distribution (the MoE) — mean and stddev of token assignment across the 4 experts.

Block D — the routing analysis¶

For the MoE model, instrument the forward pass to log which expert(s) each token gets routed to. After training, produce two analysis artifacts:

Per-token expert assignment histogram — for a representative validation batch, plot the distribution of routing. Check: is it uniform? Skewed? Are some experts dead?
Per-token-class expert assignment — group tokens by their corpus label (e.g., "verb-infinitive", "verb-past", "subject-pronoun"). Do experts specialize on linguistic categories, or on something arbitrary like token frequency?

Commit both as plots in experiments/36-moe-on-grammar-tutor/routing-analysis.png and expert-by-class.png.

Block E — the honest findings¶

findings.md — 300–600 words. Address each of:

Did the MoE variant achieve lower perplexity than the dense baseline? (Expected: no, or marginally — within noise.)
How many more parameters does the MoE have? (Expected: ~2× the FFN's worth.)
Was the load-balance loss working — did experts get roughly equal token counts?
Did experts specialize linguistically, or arbitrarily?
Did training take longer? Why?
Bottom line: should the grammar tutor adopt MoE? Answer: no, with reasoning.

Be honest. If by some chance the MoE does beat the baseline (it can happen due to noise, regularization effects, or a happy random init), document it but also report the variance: rerun both with 3 different seeds and report mean ± std. Don't cherry-pick.

Block F — manifest¶

experiments/36-moe-on-grammar-tutor/manifest.json:

{
  "seed": 36000,
  "lab": "00-moe-on-grammar-tutor",
  "num_experts": 4,
  "top_k": 2,
  "moe_layer_index": 1,
  "alpha_aux_loss": 0.01,
  "steps": 5000,
  "dense_baseline": {
    "params": "<filled in>",
    "final_val_ppl": "<filled in>",
    "wall_time_s": "<filled in>"
  },
  "moe_variant": {
    "params": "<filled in>",
    "final_val_ppl": "<filled in>",
    "wall_time_s": "<filled in>",
    "load_balance_stddev": "<filled in>",
    "experts_dead_count": "<filled in>"
  },
  "verdict_for_grammar_tutor": "<adopt / defer / never>",
  "lesson_notes": "<the bottom-line paragraph from findings.md>"
}

Constraints¶

CPU only. No cloud spend.
5000 steps maximum. Convergence is fast on this corpus; more steps don't change the answer.
No exotic optimizers. AdamW from Phase 18. Same LR schedule as the baseline.
No special MoE tricks. No expert dropout, no expert noise injection, no fancy routing. Vanilla top-2 + aux loss. The point is to see MoE, not to overfit it to win.
Bench across 3 seeds, not 1. If the variance overwhelms the dense-vs-MoE delta, report that honestly. Don't claim a win that's within noise.

Stop conditions¶

You're done when:

experiments/36-moe-on-grammar-tutor/{moe_block.py, train_moe.py, compare.py, findings.md, manifest.json, routing-analysis.png, expert-by-class.png} all exist.
findings.md answers all six questions in Block E.
manifest.json has the comparison numbers.
The verdict for the grammar tutor is defer or never — and if it's adopt, you have 3-seed evidence showing the gain exceeds variance.

Hint of last resort¶

If routing collapses (one expert gets >80% of tokens):

First check: is alpha_aux_loss too low? Try 0.05.
Second check: gate weights initialized to zero? Initialize \(W_g\) with small random values (std=0.02).
Third check: top-\(k\) vs \(E\) — for \(E=2, k=2\), routing is degenerate (every token goes to every expert), and the aux loss has no leverage. Use \(E=4, k=2\).

If MoE training is unstable (loss spikes): lower the LR by 2×. MoE is more sensitive to LR than dense.

If MoE beats the baseline by a comfortable margin (>10% perplexity reduction with low variance): take it seriously, but also check that you didn't accidentally make the dense baseline weaker than Phase 18's. The comparison must be apples-to-apples.

When to consult `solutions/`¶

After committing findings.md. Solution lives in solutions/00-moe-ref.md — written at phase open after Borja's Phase 17 + 18 are in. The reference solution includes the expected ranges of the comparison numbers (within bands of variance) so Borja can sanity-check his run.

Next lab: lab/01-mla-math-exercise.md.