English · Español
04 — Worked forward + backward through Linear → GELU → Linear¶
🇪🇸 Aquí desarmamos un MLP de 2 capas con activación GELU en medio y mostramos cada
matmul, cada bias broadcast y cada gradiente con shapes reales. Es el ejercicio que convierte la "fórmula de cadena" del Phase 7 en un objeto numérico que cabe en la memoria de trabajo. Lo hacemos sobre un batch de §A13 (4 ejemplos, 23-d one-hot de verbo⊕persona; salida 5 logits).Anchors:
LYNX_CORTEX.md§4 / PHASE 9;LYNX_CORTEX_ADDENDUM.md§A13. Phase 4 §02 chain rule; Phase 8 §03 matmul gradient.
Setup¶
A two-layer MLP with GELU in the middle, written in Phase 9's minimodel API:
mlp = Sequential(
Linear(23, 16), # W1: (16, 23), b1: (16,)
GELU(),
Linear(16, 5), # W2: (5, 16), b2: (5,)
)
Input batch X is shape (B, 23) with B = 4. Each row is one_hot(verb_id) ⊕ one_hot(person_id) — eight verbs in this micro-grid (4 regular: work, play, walk, talk; 4 irregular: be, have, do, go) padded to 20 + 3 persons = 23 features. Targets y are tense indices in {0, 1, 2, 3, 4}.
We will track shapes through forward and through the reverse-mode gradient computed by Phase 8's autograd, but write down what each backward op does by hand.
Forward pass with explicit shapes¶
Layer 1 — Linear(23, 16)¶
The matmul cost is B · 16 · 23 = 4 · 368 = 1472 multiply-adds. The bias b1 broadcasts from (16,) to (4, 16) along axis 0.
Activation — GELU¶
Pointwise, no shape change. We use the tanh approximation:
For an entry Z1[i, j] = 1.2, GELU(1.2) ≈ 1.0617; for Z1[i, j] = -0.7, GELU(-0.7) ≈ -0.1671. (Sanity values; Phase 17 derives the approximation.)
Layer 2 — Linear(16, 5)¶
Cost: B · 5 · 16 = 320 multiply-adds.
Loss¶
Cross-entropy on the 5 tenses:
The full forward path is two matmuls, one pointwise GELU, one softmax, one gather, one mean — 6 ops total. Memorize this count; it is identical (modulo factors) for every two-layer MLP you will ever write.
Backward pass — chain rule, op by op¶
We propagate dL/d• from the loss back to the parameters. Symbol: \(\nabla_x L \equiv \partial L / \partial x\).
Step 1 — softmax + cross-entropy fused gradient¶
This is the most elegant identity in the whole MLP. Phase 4 §02 derives it; Phase 8 §03 implements it.
Shape: (4, 5). Concretely, if row 0's true tense is 2 and p[0] = [0.1, 0.2, 0.5, 0.1, 0.1], then (p - onehot(y))[0] = [0.1, 0.2, -0.5, 0.1, 0.1], divided by B = 4. The negative entry is the gradient pushing the true-class logit up.
Step 2 — backward through Linear(16, 5)¶
Forward was logits = H1 @ W2.T + b2. So:
Shapes:
∇_W2 = (5, 4) @ (4, 16) → (5, 16) matches W2
∇_b2 = sum_{axis=0} (4, 5) → (5,) matches b2
∇_H1 = (4, 5) @ (5, 16) → (4, 16) matches H1
This is the standard pattern. Phase 8 §03 spells out why W's gradient picks up the input-side activation transposed: it falls out of d(x @ W.T)/dW = x and the chain rule.
Step 3 — backward through GELU¶
Pointwise multiplication by the GELU derivative:
With the tanh approximation, GELU'(z) can be expressed in closed form (homework — Lab 01 implements it). Numerically, GELU'(1.2) ≈ 1.046, GELU'(0.0) = 0.5, GELU'(-0.7) ≈ 0.151. Note GELU'(z) > 0 everywhere, so unlike ReLU there is no hard-zero gradient mask.
Shape: (4, 16). Same as H1.
Step 4 — backward through Linear(23, 16)¶
Shapes:
∇_W1 = (16, 4) @ (4, 23) → (16, 23) matches W1
∇_b1 = sum_{axis=0} (4, 16) → (16,) matches b1
∇_X = (4, 16) @ (16, 23) → (4, 23) matches X
∇_X isn't a parameter gradient, but Phase 13's embedding layer will consume it (the embedding's backward needs the gradient with respect to its output, which is X here).
Memory at the moment of backward¶
Phase 18 (training loop) will care that backward keeps the forward activations alive. For our 2-layer MLP at B = 4:
X—(4, 23)— 92 floatsZ1—(4, 16)— 64 floats (kept for GELU's backward)H1—(4, 16)— 64 floats (kept forLinear(16, 5)'s backward)logits—(4, 5)— 20 floats (kept for softmax-CE's backward)- Parameters —
W1, b1, W2, b2—368 + 16 + 80 + 5 = 469floats
Total activations to keep: ~240 floats. At FP32 that's 960 bytes. Smaller than a single L1 cache line set on the i5-8250U. This is the §A13 microscopic-scope dividend.
Parameter count vs forward FLOPs vs backward FLOPs¶
| Phase | FLOPs (approx) |
|---|---|
Forward X @ W1.T |
2 · B · 16 · 23 |
Forward H1 @ W2.T |
2 · B · 5 · 16 |
Backward ∇_W1 |
2 · 16 · B · 23 |
Backward ∇_X |
2 · B · 16 · 23 |
Backward ∇_W2 |
2 · 5 · B · 16 |
Backward ∇_H1 |
2 · B · 5 · 16 |
Add it up: backward is ~2× the forward cost for a chain of linears. This is the headline ratio that drives every memory/compute budget you'll ever do in Phase 18+ (mlflow run reports), Phase 22 (KV cache), Phase 27 (Flash attention).
A common bug this exercise prevents¶
Transposing the wrong matrix in ∇_W. The pattern is ∇_W = (∇_out)^T @ x_in, not ∇_W = ∇_out @ x_in^T. Both produce a matrix of the right shape for square problems, so a unit test on Linear(4, 4) passes silently. Run the lab on Linear(23, 16) and the shape mismatch crashes immediately. This is why Phase 9's tests use non-square shapes.
Citation¶
- Goodfellow, Bengio, Courville, Deep Learning, MIT Press, 2016. Ch. 6 §6.5 "Back-Propagation and Other Differentiation Algorithms" works out the same chain by hand.
- The fused softmax-CE gradient identity also appears in Vaswani et al. 2017 §A.1 footnote (and predates it by decades — Bridle 1990).
One-paragraph recap¶
A 2-layer MLP on a (B, 23) input does two matmuls of shape (B,23)×(23,16) and (B,16)×(16,5) with a pointwise GELU in between, costing roughly 1800 multiply-adds per example for forward, twice that for backward. The backward pass propagates (p - onehot(y))/B through the chain, with each Linear contributing the pattern ∇_W = ∇_out^T @ x_in, ∇_b = sum(∇_out, axis=0), ∇_in = ∇_out @ W. GELU contributes a positive pointwise multiplier — no hard-zero gradient. The total activation memory at B=4 is under 1 KB. Memorize the shape arithmetic of this page: every later phase (attention, MoE, LoRA) reuses the same chain-rule pattern.
Prev: 03-optimizers.md