English · Español
Break 00 — Replace GELU with identity in TenseMLP¶
🇪🇸 Rompemos
GELUpara que sea la función identidad. Tu MLP de 2 capas debería colapsar a una sola transformación lineal — predice qué métrica te lo dirá antes de mirar la pérdida.Anchors:
LYNX_CORTEX.md§4 / PHASE 9;.claude/commands/break.md.
The break¶
In src/minimodel/nn/activations.py:
Single-line edit. The class still exists, still has the right name, still composes inside Sequential. The forward pass becomes:
H1 = X @ W1.T + b1 # (4, 16) — same as before
H1 = H1 # identity, no nonlinearity
logits = H1 @ W2.T + b2 # (4, 5)
Predict, then run¶
Two layers of Linear with no nonlinearity between them is, by composition of affine maps, equivalent to a single linear layer:
where \(W_{\text{eff}} = W_2 W_1\) has rank at most min(23, 16, 5) = 5. The §A13 tense-classification task does have an exactly linear ground-truth (one-hot verb ⊕ person → 5 tenses), so you might think nothing changes. But:
- The expressible function family is identical — a single linear layer can represent every conjugation function the broken MLP can.
- The optimization trajectory differs: the broken MLP overparametrizes a rank-5 target with a
(16, 23) + (5, 16) = 448-param composite, vs. the 1-layer's(5, 23) = 115params. Adam's per-parameter step gets diluted across redundant directions.
Predictions¶
- Final validation accuracy: roughly the same (>85%), maybe slightly lower or noisier.
- Steps to convergence: more (~1.5–2×) because of the optimizer dilution above.
- Effective rank of
W2 @ W1: at most 5 (the bottleneck). - The MLP trained with GELU should reach lower loss on a held-out adversarial test set (it can fit the slight nonlinearity in person/tense agreement, like 3rd-person
-s).
Write your predictions in learners/borja/phase-09/notes/breaks.md before running. The point is to commit a hypothesis.
Observe¶
Run experiments/09-tense-mlp/train.py with the broken activation. Compare against the previous run:
Diagnostics to plot:
- Loss curve overlay —
geluvsidentity. np.linalg.matrix_rank(W2.data @ W1.data)at end of training — should be ≤ 5.- Held-out accuracy on the 10 hardest examples (e.g., 3rd-person past-participle of irregulars).
Symptom Borja will see¶
- Convergence works but is slower. Loss does not blow up; this is the subtle case.
- Adversarial accuracy slightly worse.
matrix_rank(W2 @ W1) == 5.
Hidden cause (one sentence)¶
GELU was replaced by identity, collapsing the MLP to an effective rank-5 linear map.
Hint cascade¶
- The forward pass shows
(B, 16) → (B, 5)— but is the intermediate representation ever actually nonlinear? PrintH1distribution. - What is
np.linalg.matrix_rank(W2.data @ W1.data)? Is it bounded by something architectural? - The class
GELUinnn/activations.pywas changed. Compare theforwardmethod to Phase 9 §02 theory.
Fix diff¶
Why this teaches the concept¶
Universal-approximation theorems (Cybenko 1989, Hornik et al. 1989) require a nonlinearity between layers. Without one, depth does not add expressivity. The §A13 task happens to be linearly separable, so the symptom is degraded optimization, not degraded representation. This is the kind of bug that ships to production — the model "works", but worse than it should — and the only way to catch it is to understand why GELU is there. Phase 17's transformer FFN uses GELU for the same reason; Phase 10's residual derivation needs nonzero curvature in f(x) to make sense.
Next: Phase 10's /break on the residual highway.