Skip to content

English · Español

04 — Worked forward + backward through Linear → GELU → Linear

🇪🇸 Aquí desarmamos un MLP de 2 capas con activación GELU en medio y mostramos cada matmul, cada bias broadcast y cada gradiente con shapes reales. Es el ejercicio que convierte la "fórmula de cadena" del Phase 7 en un objeto numérico que cabe en la memoria de trabajo. Lo hacemos sobre un batch de §A13 (4 ejemplos, 23-d one-hot de verbo⊕persona; salida 5 logits).

Anchors: LYNX_CORTEX.md §4 / PHASE 9; LYNX_CORTEX_ADDENDUM.md §A13. Phase 4 §02 chain rule; Phase 8 §03 matmul gradient.


Setup

A two-layer MLP with GELU in the middle, written in Phase 9's minimodel API:

mlp = Sequential(
    Linear(23, 16),   # W1: (16, 23), b1: (16,)
    GELU(),
    Linear(16, 5),    # W2: (5, 16),  b2: (5,)
)

Input batch X is shape (B, 23) with B = 4. Each row is one_hot(verb_id) ⊕ one_hot(person_id) — eight verbs in this micro-grid (4 regular: work, play, walk, talk; 4 irregular: be, have, do, go) padded to 20 + 3 persons = 23 features. Targets y are tense indices in {0, 1, 2, 3, 4}.

We will track shapes through forward and through the reverse-mode gradient computed by Phase 8's autograd, but write down what each backward op does by hand.


Forward pass with explicit shapes

Layer 1 — Linear(23, 16)

X        : (B, 23)        = (4, 23)
W1       : (16, 23)
b1       : (16,)
Z1 = X @ W1.T + b1
   shape : (4, 16)

The matmul cost is B · 16 · 23 = 4 · 368 = 1472 multiply-adds. The bias b1 broadcasts from (16,) to (4, 16) along axis 0.

Activation — GELU

H1 = GELU(Z1)
   shape : (4, 16)

Pointwise, no shape change. We use the tanh approximation:

\[ \text{GELU}(z) \approx 0.5\,z\,\bigl(1 + \tanh\bigl[\sqrt{2/\pi}\,(z + 0.044715\,z^{3})\bigr]\bigr) \]

For an entry Z1[i, j] = 1.2, GELU(1.2) ≈ 1.0617; for Z1[i, j] = -0.7, GELU(-0.7) ≈ -0.1671. (Sanity values; Phase 17 derives the approximation.)

Layer 2 — Linear(16, 5)

H1       : (4, 16)
W2       : (5, 16)
b2       : (5,)
logits = H1 @ W2.T + b2
   shape : (4, 5)

Cost: B · 5 · 16 = 320 multiply-adds.

Loss

Cross-entropy on the 5 tenses:

p = softmax(logits, axis=-1)      # (4, 5)
loss = -mean(log(p[arange(B), y])) # scalar

The full forward path is two matmuls, one pointwise GELU, one softmax, one gather, one mean — 6 ops total. Memorize this count; it is identical (modulo factors) for every two-layer MLP you will ever write.


Backward pass — chain rule, op by op

We propagate dL/d• from the loss back to the parameters. Symbol: \(\nabla_x L \equiv \partial L / \partial x\).

Step 1 — softmax + cross-entropy fused gradient

This is the most elegant identity in the whole MLP. Phase 4 §02 derives it; Phase 8 §03 implements it.

\[ \nabla_{\text{logits}} L = \frac{1}{B}\,(p - \mathrm{onehot}(y)) \]

Shape: (4, 5). Concretely, if row 0's true tense is 2 and p[0] = [0.1, 0.2, 0.5, 0.1, 0.1], then (p - onehot(y))[0] = [0.1, 0.2, -0.5, 0.1, 0.1], divided by B = 4. The negative entry is the gradient pushing the true-class logit up.

Step 2 — backward through Linear(16, 5)

Forward was logits = H1 @ W2.T + b2. So:

\[ \nabla_{W2} L = (\nabla_{\text{logits}} L)^{\!\top} \cdot H1 \qquad \nabla_{b2} L = \sum_{i=1}^{B} (\nabla_{\text{logits}} L)_{i,\cdot} \qquad \nabla_{H1} L = (\nabla_{\text{logits}} L) \cdot W2 \]

Shapes:

∇_W2     = (5, 4) @ (4, 16) → (5, 16)   matches W2
∇_b2     = sum_{axis=0} (4, 5) → (5,)   matches b2
∇_H1     = (4, 5) @ (5, 16) → (4, 16)   matches H1

This is the standard pattern. Phase 8 §03 spells out why W's gradient picks up the input-side activation transposed: it falls out of d(x @ W.T)/dW = x and the chain rule.

Step 3 — backward through GELU

Pointwise multiplication by the GELU derivative:

\[ \nabla_{Z1} L = \nabla_{H1} L \;\odot\; \text{GELU}'(Z1) \]

With the tanh approximation, GELU'(z) can be expressed in closed form (homework — Lab 01 implements it). Numerically, GELU'(1.2) ≈ 1.046, GELU'(0.0) = 0.5, GELU'(-0.7) ≈ 0.151. Note GELU'(z) > 0 everywhere, so unlike ReLU there is no hard-zero gradient mask.

Shape: (4, 16). Same as H1.

Step 4 — backward through Linear(23, 16)

\[ \nabla_{W1} L = (\nabla_{Z1} L)^{\!\top} \cdot X \qquad \nabla_{b1} L = \sum_{i=1}^{B} (\nabla_{Z1} L)_{i,\cdot} \qquad \nabla_{X} L = (\nabla_{Z1} L) \cdot W1 \]

Shapes:

∇_W1     = (16, 4) @ (4, 23) → (16, 23)   matches W1
∇_b1     = sum_{axis=0} (4, 16) → (16,)   matches b1
∇_X      = (4, 16) @ (16, 23) → (4, 23)   matches X

∇_X isn't a parameter gradient, but Phase 13's embedding layer will consume it (the embedding's backward needs the gradient with respect to its output, which is X here).


Memory at the moment of backward

Phase 18 (training loop) will care that backward keeps the forward activations alive. For our 2-layer MLP at B = 4:

  • X(4, 23) — 92 floats
  • Z1(4, 16) — 64 floats (kept for GELU's backward)
  • H1(4, 16) — 64 floats (kept for Linear(16, 5)'s backward)
  • logits(4, 5) — 20 floats (kept for softmax-CE's backward)
  • Parameters — W1, b1, W2, b2368 + 16 + 80 + 5 = 469 floats

Total activations to keep: ~240 floats. At FP32 that's 960 bytes. Smaller than a single L1 cache line set on the i5-8250U. This is the §A13 microscopic-scope dividend.


Parameter count vs forward FLOPs vs backward FLOPs

Phase FLOPs (approx)
Forward X @ W1.T 2 · B · 16 · 23
Forward H1 @ W2.T 2 · B · 5 · 16
Backward ∇_W1 2 · 16 · B · 23
Backward ∇_X 2 · B · 16 · 23
Backward ∇_W2 2 · 5 · B · 16
Backward ∇_H1 2 · B · 5 · 16

Add it up: backward is ~2× the forward cost for a chain of linears. This is the headline ratio that drives every memory/compute budget you'll ever do in Phase 18+ (mlflow run reports), Phase 22 (KV cache), Phase 27 (Flash attention).


A common bug this exercise prevents

Transposing the wrong matrix in ∇_W. The pattern is ∇_W = (∇_out)^T @ x_in, not ∇_W = ∇_out @ x_in^T. Both produce a matrix of the right shape for square problems, so a unit test on Linear(4, 4) passes silently. Run the lab on Linear(23, 16) and the shape mismatch crashes immediately. This is why Phase 9's tests use non-square shapes.


Citation

  • Goodfellow, Bengio, Courville, Deep Learning, MIT Press, 2016. Ch. 6 §6.5 "Back-Propagation and Other Differentiation Algorithms" works out the same chain by hand.
  • The fused softmax-CE gradient identity also appears in Vaswani et al. 2017 §A.1 footnote (and predates it by decades — Bridle 1990).

One-paragraph recap

A 2-layer MLP on a (B, 23) input does two matmuls of shape (B,23)×(23,16) and (B,16)×(16,5) with a pointwise GELU in between, costing roughly 1800 multiply-adds per example for forward, twice that for backward. The backward pass propagates (p - onehot(y))/B through the chain, with each Linear contributing the pattern ∇_W = ∇_out^T @ x_in, ∇_b = sum(∇_out, axis=0), ∇_in = ∇_out @ W. GELU contributes a positive pointwise multiplier — no hard-zero gradient. The total activation memory at B=4 is under 1 KB. Memorize the shape arithmetic of this page: every later phase (attention, MoE, LoRA) reuses the same chain-rule pattern.


Prev: 03-optimizers.md