English · Español

Break 00 — MoE with a broken router (all tokens to one expert); the degenerate case¶

🇪🇸 Si quitas la auxiliary loss del MoE, el router rápidamente colapsa a un experto único. La pérdida de entrenamiento se ve "bien" — todavía baja — pero el modelo se comporta como una FFN normal con N-1 experts sin entrenar. Este /break lo demuestra desactivando la aux loss.

What you'll do¶

Disable the auxiliary load-balancing loss in the toy MoE built in docs/phase-36-frontier-architectures/lab/00-moe-on-grammar-tutor.md. Watch the router collapse to a single expert within 200 steps; observe that the main loss does not signal the failure.

Step 1 — Locate the MoE block¶

src/minimoe/moe_block.py          # the toy MoE layer (Phase 36 lab 00)
src/minimoe/loss.py               # the combined train loss = main + aux

Step 2 — Introduce the bug¶

In src/minimoe/loss.py, the combined loss currently weights the auxiliary load-balancing term by alpha=0.01. Zero it out:

# OLD
total_loss = main_loss + 0.01 * aux_load_balancing_loss(gates, mask)

# NEW (the broken version)
total_loss = main_loss + 0.0 * aux_load_balancing_loss(gates, mask)

One numeric constant. The aux loss is still computed and logged (so we have a metric to detect the collapse), it just doesn't gradient-flow.

Step 3 — Record the break¶

learners/borja/phase-36/notes/breaks.md:

- bug-id: 36-01
  concept: MoE load-balancing aux loss
  symptom: main train loss falls normally; but the per-expert token count
           collapses — after ~200 steps, expert 0 sees 95%+ of tokens and the
           other 3 experts see ~0. Held-out val loss is worse than the dense
           baseline despite 4× the params.
  hidden_cause: the aux loss coefficient is zero; nothing pushes the router
                to spread tokens across experts.
  hint_1: "Log per-expert token counts. Plot them across steps. What's the trend?"
  hint_2: "What does the aux loss penalize, and what does its coefficient control?"
  hint_3: "diff src/minimoe/loss.py. Has anything multiplying the aux loss changed?"
  fix_diff: restore alpha=0.01 in the loss combination.

Step 4 — Verify it's observable¶

Run just moe-train --steps 500. Expected output with bug:

step=  10  main=2.401  aux=0.124  expert_counts=[ 240  238  244  254 ]
step= 100  main=2.018  aux=0.310  expert_counts=[ 540  191  119  126 ]
step= 200  main=1.872  aux=0.892  expert_counts=[ 893  62   18   3  ]
step= 300  main=1.748  aux=0.988  expert_counts=[ 945  20   8    3  ]
step= 400  main=1.665  aux=0.996  expert_counts=[ 962  10   3    1  ]
step= 500  main=1.601  aux=0.998  expert_counts=[ 970  4    2    0  ]   <-- collapse
val_loss=2.143  (dense_baseline=1.890)   <-- 4× params, *worse* val loss

The metric aux_load_balancing_loss rises toward 1.0 (its maximum under the Switch formulation: \(f_e P_e\) summed with all mass on one expert tends to 1). Main loss looks fine. The held-out val loss is the smoking gun.

The test tests/phase36/test_moe_balance.py::test_no_expert_above_50_percent goes red.

Step 5 — The teaching moment¶

Two lessons, one bug:

Main loss alone does not detect MoE pathology. It will keep falling because the single active expert is still learning a fine FFN. The architecture is silently degenerating to a 1-expert model with 3 dead experts.
The aux loss is structural, not cosmetic. Without it, the system is not an MoE in any meaningful sense. The 0.01 coefficient looks small; deleting it deletes the architecture.

The fix is one constant; the lesson is that the loss function is part of the architectural definition, not a tunable training detail.

Hard rules respected¶

Single bug (one numeric constant).
Reversible in 1 line.
Observable (per-expert token counts diverge; val loss regresses).
No security impact.
Tests not modified.

Next: when green, re-read ../theory/05-moe-routing-math-and-mamba-intuition.md for the formal \(f_e P_e\) derivation.