Skip to content

English · Español

Lab 00 — A 2-expert MoE variant of MiniGPT-grammar (and why it doesn't help)

Goal: swap the FFN in one transformer block for a 2-expert MoE. Train locally on CPU. Honestly compare to the dense baseline. Confirm the negative result: MoE doesn't help the grammar tutor.

Estimated time: 3–4 hours.

Prereq: Phase 17 MiniGPT-grammar trained; Phase 18 training loop in src/minitrain/; Phase 19 inspection hooks for monitoring loss. Borja has read theory/01-moe.md.


What you produce

A directory experiments/36-moe-on-grammar-tutor/ containing:

  • moe_block.py — local (experiment-scoped) module with a MoEBlock class implementing routing + 2 expert FFNs + load-balancing aux loss. Per the phase plan, this lives in the experiment directory (or as src/minimodel/_moe_block.py if Borja prefers — single optional file under an existing module, not a new top-level module).
  • train_moe.py — training script: builds a MiniGPT-grammar variant with the MoE block replacing one or two of the dense FFNs.
  • compare.py — runs the baseline (Phase 18 dense MiniGPT) and the MoE variant, side-by-side: same data, same seed, same step count.
  • manifest.json — final metrics, parameter counts, perplexity, training time.
  • findings.md — the honest report on the comparison.

TODOs

Block A — implement the MoE block

# experiments/36-moe-on-grammar-tutor/moe_block.py
# Skeleton — Borja writes the body. ~150 LOC target.

import torch
from torch import nn
import torch.nn.functional as F

class MoEBlock(nn.Module):
    """Top-2 routing of E experts. Returns (output, aux_loss).

    For the grammar tutor at d_model=64, d_ff=256, set E=2, k=2 to start
    (every token goes to every expert — degenerate, mostly for testing).
    Then E=4, k=2 for the actual experiment.
    """
    def __init__(self, d_model: int, d_ff: int, num_experts: int, top_k: int):
        super().__init__()
        self.gate = nn.Linear(d_model, num_experts, bias=False)
        self.experts = nn.ModuleList([
            nn.Sequential(
                nn.Linear(d_model, d_ff),
                nn.GELU(),
                nn.Linear(d_ff, d_model),
            )
            for _ in range(num_experts)
        ])
        self.top_k = top_k
        self.num_experts = num_experts

    def forward(self, x):
        # x: [B, T, d_model]
        # 1. compute gate logits, softmax, top-k indices and weights
        # 2. for each expert, gather the tokens assigned to it
        # 3. compute the expert's FFN on its assigned tokens
        # 4. scatter back into the output, weighted by gate
        # 5. compute auxiliary load-balance loss (theory/01-moe.md §2)
        # 6. return (output [B, T, d_model], aux_loss [scalar])
        ...

Implementation notes:

  • No capacity dropping. At grammar-tutor scale every token can fit. Capacity factor = ∞.
  • Differentiable top-k. PyTorch's torch.topk is not differentiable through the indices, but the gate weights (the softmax values) flow gradients. That's enough.
  • Aux loss scale. Start with \(\alpha = 0.01\). Tune if routing collapses.

Tests in tests/minimodel/test_moe_block.py (if you place the block in src/minimodel/):

  • test_routing_is_top_k — every token gets exactly \(k\) non-zero gate weights.
  • test_aux_loss_is_one_for_uniform — for uniform routing, aux loss equals 1.
  • test_aux_loss_grows_under_imbalance — feed all tokens to expert 0; aux loss should be \(E\).
  • test_total_params_grow_as_E — param count of MoE with \(E\) experts ≈ \(E\) × dense FFN.

Block B — modify the model

Take the Phase 17 MiniGPT-grammar and replace exactly one block's FFN with the MoE block. Single FFN replacement is enough — replacing all of them adds proportional cost without proportional insight at this scale.

The model factory should accept --moe-layer=1 (which block index gets the MoE) and --num-experts=4 --top-k=2.

Block C — the comparison protocol

For a fair comparison:

  • Same seed (e.g., 36000).
  • Same data shards (Phase 12 corpus splits).
  • Same LR schedule (Phase 18 cosine + warmup).
  • Same step count (e.g., 5000 steps).
  • Same batch size.
  • Same eval probe set (Phase 20 eval harness).

Run both models. Record:

  • Final training loss.
  • Final validation perplexity.
  • Total parameters.
  • Training wall time.
  • Per-expert load distribution (the MoE) — mean and stddev of token assignment across the 4 experts.

Block D — the routing analysis

For the MoE model, instrument the forward pass to log which expert(s) each token gets routed to. After training, produce two analysis artifacts:

  • Per-token expert assignment histogram — for a representative validation batch, plot the distribution of routing. Check: is it uniform? Skewed? Are some experts dead?
  • Per-token-class expert assignment — group tokens by their corpus label (e.g., "verb-infinitive", "verb-past", "subject-pronoun"). Do experts specialize on linguistic categories, or on something arbitrary like token frequency?

Commit both as plots in experiments/36-moe-on-grammar-tutor/routing-analysis.png and expert-by-class.png.

Block E — the honest findings

findings.md — 300–600 words. Address each of:

  1. Did the MoE variant achieve lower perplexity than the dense baseline? (Expected: no, or marginally — within noise.)
  2. How many more parameters does the MoE have? (Expected: ~2× the FFN's worth.)
  3. Was the load-balance loss working — did experts get roughly equal token counts?
  4. Did experts specialize linguistically, or arbitrarily?
  5. Did training take longer? Why?
  6. Bottom line: should the grammar tutor adopt MoE? Answer: no, with reasoning.

Be honest. If by some chance the MoE does beat the baseline (it can happen due to noise, regularization effects, or a happy random init), document it but also report the variance: rerun both with 3 different seeds and report mean ± std. Don't cherry-pick.

Block F — manifest

experiments/36-moe-on-grammar-tutor/manifest.json:

{
  "seed": 36000,
  "lab": "00-moe-on-grammar-tutor",
  "num_experts": 4,
  "top_k": 2,
  "moe_layer_index": 1,
  "alpha_aux_loss": 0.01,
  "steps": 5000,
  "dense_baseline": {
    "params": "<filled in>",
    "final_val_ppl": "<filled in>",
    "wall_time_s": "<filled in>"
  },
  "moe_variant": {
    "params": "<filled in>",
    "final_val_ppl": "<filled in>",
    "wall_time_s": "<filled in>",
    "load_balance_stddev": "<filled in>",
    "experts_dead_count": "<filled in>"
  },
  "verdict_for_grammar_tutor": "<adopt / defer / never>",
  "lesson_notes": "<the bottom-line paragraph from findings.md>"
}

Constraints

  • CPU only. No cloud spend.
  • 5000 steps maximum. Convergence is fast on this corpus; more steps don't change the answer.
  • No exotic optimizers. AdamW from Phase 18. Same LR schedule as the baseline.
  • No special MoE tricks. No expert dropout, no expert noise injection, no fancy routing. Vanilla top-2 + aux loss. The point is to see MoE, not to overfit it to win.
  • Bench across 3 seeds, not 1. If the variance overwhelms the dense-vs-MoE delta, report that honestly. Don't claim a win that's within noise.

Stop conditions

You're done when:

  1. experiments/36-moe-on-grammar-tutor/{moe_block.py, train_moe.py, compare.py, findings.md, manifest.json, routing-analysis.png, expert-by-class.png} all exist.
  2. findings.md answers all six questions in Block E.
  3. manifest.json has the comparison numbers.
  4. The verdict for the grammar tutor is defer or never — and if it's adopt, you have 3-seed evidence showing the gain exceeds variance.

Hint of last resort

If routing collapses (one expert gets >80% of tokens):

  • First check: is alpha_aux_loss too low? Try 0.05.
  • Second check: gate weights initialized to zero? Initialize \(W_g\) with small random values (std=0.02).
  • Third check: top-\(k\) vs \(E\) — for \(E=2, k=2\), routing is degenerate (every token goes to every expert), and the aux loss has no leverage. Use \(E=4, k=2\).

If MoE training is unstable (loss spikes): lower the LR by 2×. MoE is more sensitive to LR than dense.

If MoE beats the baseline by a comfortable margin (>10% perplexity reduction with low variance): take it seriously, but also check that you didn't accidentally make the dense baseline weaker than Phase 18's. The comparison must be apples-to-apples.

When to consult solutions/

After committing findings.md. Solution lives in solutions/00-moe-ref.md — written at phase open after Borja's Phase 17 + 18 are in. The reference solution includes the expected ranges of the comparison numbers (within bands of variance) so Borja can sanity-check his run.


Next lab: lab/01-mla-math-exercise.md.