English · Español

Break 00 — Replace GELU with identity in `TenseMLP`¶

🇪🇸 Rompemos GELU para que sea la función identidad. Tu MLP de 2 capas debería colapsar a una sola transformación lineal — predice qué métrica te lo dirá antes de mirar la pérdida.

Anchors: LYNX_CORTEX.md §4 / PHASE 9; .claude/commands/break.md.

The break¶

In src/minimodel/nn/activations.py:

class GELU(Module):
    def forward(self, x: Tensor) -> Tensor:
        return x                       # was: return x.gelu()

Single-line edit. The class still exists, still has the right name, still composes inside Sequential. The forward pass becomes:

H1 = X @ W1.T + b1                    # (4, 16) — same as before
H1 = H1                               # identity, no nonlinearity
logits = H1 @ W2.T + b2               # (4, 5)

Predict, then run¶

Two layers of Linear with no nonlinearity between them is, by composition of affine maps, equivalent to a single linear layer:

\[ \text{logits} = (X W_1^{\top} + b_1) W_2^{\top} + b_2 = X (W_1^{\top} W_2^{\top}) + (b_1 W_2^{\top} + b_2) = X W_{\text{eff}}^{\top} + b_{\text{eff}} \]

where \(W_{\text{eff}} = W_2 W_1\) has rank at most min(23, 16, 5) = 5. The §A13 tense-classification task does have an exactly linear ground-truth (one-hot verb ⊕ person → 5 tenses), so you might think nothing changes. But:

The expressible function family is identical — a single linear layer can represent every conjugation function the broken MLP can.
The optimization trajectory differs: the broken MLP overparametrizes a rank-5 target with a (16, 23) + (5, 16) = 448-param composite, vs. the 1-layer's (5, 23) = 115 params. Adam's per-parameter step gets diluted across redundant directions.

Predictions¶

Final validation accuracy: roughly the same (>85%), maybe slightly lower or noisier.
Steps to convergence: more (~1.5–2×) because of the optimizer dilution above.
Effective rank of W2 @ W1: at most 5 (the bottleneck).
The MLP trained with GELU should reach lower loss on a held-out adversarial test set (it can fit the slight nonlinearity in person/tense agreement, like 3^rd-person -s).

Write your predictions in learners/borja/phase-09/notes/breaks.md before running. The point is to commit a hypothesis.

Observe¶

Run experiments/09-tense-mlp/train.py with the broken activation. Compare against the previous run:

just exp 09-tense-mlp --tag broken-gelu-identity

Diagnostics to plot:

Loss curve overlay — gelu vs identity.
np.linalg.matrix_rank(W2.data @ W1.data) at end of training — should be ≤ 5.
Held-out accuracy on the 10 hardest examples (e.g., 3^rd-person past-participle of irregulars).

Symptom Borja will see¶

Convergence works but is slower. Loss does not blow up; this is the subtle case.
Adversarial accuracy slightly worse.
matrix_rank(W2 @ W1) == 5.

Hidden cause (one sentence)¶

GELU was replaced by identity, collapsing the MLP to an effective rank-5 linear map.

Hint cascade¶

The forward pass shows (B, 16) → (B, 5) — but is the intermediate representation ever actually nonlinear? Print H1 distribution.
What is np.linalg.matrix_rank(W2.data @ W1.data)? Is it bounded by something architectural?
The class GELU in nn/activations.py was changed. Compare the forward method to Phase 9 §02 theory.

Fix diff¶

class GELU(Module):
    def forward(self, x: Tensor) -> Tensor:
        return x.gelu()                # restored

Why this teaches the concept¶

Universal-approximation theorems (Cybenko 1989, Hornik et al. 1989) require a nonlinearity between layers. Without one, depth does not add expressivity. The §A13 task happens to be linearly separable, so the symptom is degraded optimization, not degraded representation. This is the kind of bug that ships to production — the model "works", but worse than it should — and the only way to catch it is to understand why GELU is there. Phase 17's transformer FFN uses GELU for the same reason; Phase 10's residual derivation needs nonzero curvature in f(x) to make sense.

Next: Phase 10's /break on the residual highway.

Break 00 — Replace GELU with identity in TenseMLP¶