English · Español

01 — The transformer block: Pre-LN anatomy¶

🇪🇸 El bloque es asombrosamente simple cuando lo ves bien: dos sublayers, dos LayerNorms, dos sumas residuales. La sutileza es el orden — LN antes de la sublayer, no después. Esa pequeña diferencia se ganó la guerra contra Post-LN porque entrena más estable.

The block, in one diagram¶

        ┌─────────── residual stream (d_model) ─────────────┐
        │                                                   │
   x ──>┤                                                   │
        │                                                   │
        ├──> LayerNorm ──> MultiHeadAttention ──┐            │
        │                                       ▼            │
        ├──────────────────────────────────────(+)──> z ─────┤
        │                                                    │
        ├──> LayerNorm ──> FFN ────────────────┐             │
        │                                       ▼            │
        └──────────────────────────────────────(+)──> y ─────┘

In equations:

\[ \begin{aligned} z &= x + \text{MHA}(\text{LN}_1(x)) \\ y &= z + \text{FFN}(\text{LN}_2(z)) \end{aligned} \]

That's the whole block. Two normalizations, two sublayers, two additions. The output \(y\) has the same shape as \(x\): \((T, d_\text{model})\) — sequence length \(T\), residual width \(d_\text{model}\). Stacking blocks is straightforward because shapes match.

Pre-LN vs Post-LN — the one piece of history that matters¶

The original transformer (Vaswani et al. 2017) used Post-LN:

\[ \begin{aligned} z &= \text{LN}_1(x + \text{MHA}(x)) \\ y &= \text{LN}_2(z + \text{FFN}(z)) \end{aligned} \]

LN sits outside the residual addition. This caused training instability: the residual stream's variance can grow unbounded with depth, since every block adds an unnormalised sublayer output. Deep stacks (12+ layers) needed careful warmup schedules to avoid loss explosions.

Pre-LN (Xiong et al. 2020 "On Layer Normalization in the Transformer Architecture") moves LN inside the residual addition, normalizing the input to each sublayer rather than the sum:

\[y = x + \text{sublayer}(\text{LN}(x))\]

This bounds the variance growth and makes warmup nearly unnecessary. Every modern transformer (GPT-2 onwards, LLaMA, PaLM, Gemini, Claude) uses Pre-LN. We use Pre-LN. Post-LN is mentioned for context only.

A tiny but important detail: with Pre-LN, the input to the LM head is not yet normalized — the residual stream after the last block has accumulated unnormalized sublayer outputs. Modern transformers therefore add a final LayerNorm after the last block and before the LM head. Don't forget it (Lab 01 will catch this).

LayerNorm, briefly¶

You implemented LN intuition in Phase 10. The formula:

\[\text{LN}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \varepsilon}} + \beta\]

where \(\mu, \sigma^2\) are computed over the feature dimension (the last axis, of size \(d_\text{model}\)), not over the batch or sequence axes. \(\gamma, \beta \in \mathbb{R}^{d_\text{model}}\) are learned per-LN-instance. \(\varepsilon \approx 10^{-5}\) for fp32 (sometimes \(10^{-6}\)).

Two parameters per feature: scale \(\gamma\) and shift \(\beta\). For a Mini-GPT with 2 layers, that's \(2 \times 2 \times d_\text{model} = 256\) params for all LNs inside blocks, plus \(2 \times d_\text{model} = 128\) for the final LN. Total LN params: 384 — a rounding error in the full inventory.

LayerNorm is sometimes replaced by RMSNorm (Zhang & Sennrich 2019) in modern transformers — drop the mean subtraction and the \(\beta\). Cheaper, similar performance. Phase 17 sticks with LayerNorm for fidelity to GPT-2. RMSNorm is a one-line swap if needed in a later phase.

The full forward, layer-by-layer¶

Let \(T\) = sequence length, \(|V|\) = vocab size, \(L\) = \(n_\text{layers}\).

Input:  tokens (T,) int32
        ↓
[1]     E[tokens]              shape (T, d_model)         token embedding lookup
        ↓
        (no PE added here — RoPE applied inside attention)
        ↓
[2]     for l in range(L):
            h = h + MHA_l(LN_a(h))    shape (T, d_model)   attention sublayer
            h = h + FFN_l(LN_b(h))    shape (T, d_model)   FFN sublayer
        ↓
[3]     h = LN_final(h)           shape (T, d_model)       final norm
        ↓
[4]     logits = h @ E.T          shape (T, |V|)           tied LM head
        ↓
Output: logits (T, |V|) float32

Six tensors flow through this forward, all with predictable shapes. Borja's Lab 00 traces them by hand on a tiny configuration.

A reading of the residual stream¶

Three observations from the residual-stream view:

The residual stream is read by every sublayer and written by every sublayer. Nothing else can communicate. There's no global state, no skip-across-blocks. Information flows only through the stream.
Sublayers are additive. A given block can either contribute (the addition is nonzero) or no-op (the addition is near zero — happens for some heads in trained models). The skip is always available.
Attention writes the result of cross-position mixing; FFN writes the result of per-position transformation. Attention is communication; FFN is computation. The block pairs them.

Causal masking — still required, even with RoPE¶

A common confusion: RoPE provides positional info, but it does not make attention causal. You still need to mask the attention matrix so each position can only attend to itself and earlier positions:

\[\text{mask}[i, j] = \begin{cases} 0 & \text{if } j \le i \\ -\infty & \text{otherwise} \end{cases}\]

Added to the pre-softmax attention scores. Position 0 attends to position 0 only. Position 7 attends to 0–7. Without this mask, the model sees future tokens — which is fine at inference but ruinous at training (the model just copies the next token).

The causal mask lives inside the MHA implementation (Phase 15). Lab 03 verifies, by perturbation, that the mask is correctly wired all the way to the output.

What the block looks like in code (sketch only — implementation in lab)¶

class TransformerBlock:
    def __init__(self, d_model, n_heads, d_ff):
        self.ln1 = LayerNorm(d_model)
        self.attn = MultiHeadAttention(d_model, n_heads)  # from Phase 15
        self.ln2 = LayerNorm(d_model)
        self.ffn = FFN(d_model, d_ff)

    def forward(self, x):
        # x: (T, d_model)
        x = x + self.attn(self.ln1(x))
        x = x + self.ffn(self.ln2(x))
        return x

Five non-trivial lines. Phase 17's lab 00 builds this by hand; lab 01 stacks it.

What this file does NOT cover¶

The attention math. Phase 15. Treat MHA as: input \((T, d_\text{model})\), output \((T, d_\text{model})\), applies Q/K/V projections, RoPE on Q and K, scaled dot-product attention with causal mask, output projection.
The FFN. Next file (02-ffn-and-activations.md).
Tied embedding + LM head. File 03.

Next: 02-ffn-and-activations.md