English · Español

Lab 00 — One transformer block, by hand¶

Read theory/01-transformer-block.md and theory/02-ffn-and-activations.md first. Do not consult solutions/.

Objective¶

Implement a single Pre-LN transformer block in NumPy and verify the forward pass against a hand-derived reference on a tiny toy configuration. The point is complete numerical control — if you can't trace one block on a 2-token, \(d_\text{model} = 4\) example, you don't understand it.

Setup¶

A new module: src/minimodel/transformer/. Three files:

src/minimodel/transformer/
├── __init__.py
├── layer_norm.py
├── ffn.py
└── block.py

Each file gets a BLUEPRINT.md first (you've done this drill in Phases 7, 8, 13, 15 — same pattern). MultiHeadAttention is imported from Phase 15's module.

Tasks¶

Task 1 — `LayerNorm`¶

In src/minimodel/transformer/layer_norm.py:

class LayerNorm:
    """LayerNorm over the last axis. Parameters: gamma (scale), beta (shift)."""
    def __init__(self, d_model: int, eps: float = 1e-5):
        ...

    def __call__(self, x: NDArray[np.float64]) -> NDArray[np.float64]:
        """x: (..., d_model) → (..., d_model). Normalises over the last axis."""

Constraints:

Pure NumPy. Backward pass is via the Phase 8 autograd tensor — your LayerNorm should accept and return autograd tensors, not raw arrays. (If your Phase 8 autograd interface is finished — confirm at phase-open.)
Validate shapes: input's last axis must equal d_model.
Compute \(\mu, \sigma^2\) over the last axis, not over batch or sequence.

Property tests (add to tests/test_phase17_layer_norm.py):

Zero mean, unit variance after normalisation. For input \(x \sim \mathcal{N}(0, I)\), LN(x) (with \(\gamma = 1, \beta = 0\)) should have \(\approx 0\) mean and \(\approx 1\) variance along the last axis.
Identity under \(\gamma = 1, \beta = 0\) on already-normalised input.
Shift / scale recovery. Set \(\gamma = 2, \beta = 5\); verify output mean ≈ 5, variance ≈ 4.

Task 2 — `FFN`¶

In src/minimodel/transformer/ffn.py:

class FFN:
    """Position-wise feed-forward: GELU(x W1 + b1) W2 + b2."""
    def __init__(self, d_model: int, d_ff: int):
        ...

    def __call__(self, x): ...

Constraints:

Use the approximate GELU: 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x**3))).
Initialise \(W_1, W_2\) with Xavier (Phase 10 default).
Biases zero at init.

Task 3 — `TransformerBlock`¶

In src/minimodel/transformer/block.py:

class TransformerBlock:
    def __init__(self, d_model: int, n_heads: int, d_ff: int):
        self.ln1 = LayerNorm(d_model)
        self.attn = MultiHeadAttention(d_model, n_heads)  # from Phase 15
        self.ln2 = LayerNorm(d_model)
        self.ffn = FFN(d_model, d_ff)

    def __call__(self, x):
        # Pre-LN: norm → sublayer → residual add
        x = x + self.attn(self.ln1(x))
        x = x + self.ffn(self.ln2(x))
        return x

Task 4 — hand-derived reference¶

Pick a tiny config: \(d_\text{model} = 4, n_\text{heads} = 1, d_\text{ff} = 8\), \(T = 2\) tokens. Initialise all parameters with explicit small values you write down (not random — use a fixed list of fractions like \(0.1, 0.2, \ldots\)). Compute the block's forward pass on paper, step by step:

\(\text{LN}_1(x)\) — compute \(\mu, \sigma\) per row, normalise, scale, shift.
\(\text{MHA}(\cdot)\) — Q, K, V projections, dot products, softmax, output projection. (With one head, this is a single attention.)
Residual: \(z = x + \text{MHA output}\).
\(\text{LN}_2(z)\) — same procedure as step 1, on \(z\).
\(\text{FFN}(\cdot)\) — up-project, GELU, down-project.
Final residual: \(y = z + \text{FFN output}\).

Now run your NumPy implementation on the same inputs and parameters. Assert np.allclose(numpy_result, hand_result, atol=1e-5).

This is tedious. It will take 1-2 hours. Do it anyway. You will catch:

LayerNorm axis bugs (normalising over the wrong axis).
GELU approximation errors (using the exact form when you said approximate, or vice versa).
Residual ordering bugs (x + LN(attn(x)) instead of x + attn(LN(x))).
Q/K/V shape confusion.

Task 5 — sanity tests¶

Add to tests/test_phase17_block.py:

Shape preservation. Block input \((T, d_\text{model})\) → output \((T, d_\text{model})\), same shape.
No-op-at-zero. If all parameters are zero (or close), the block should be near-identity (since both sublayers contribute ~0). Verify \(|\text{block}(x) - x| < 10^{-3}\) for normalised input.
Differentiability check. A forward + backward + gradient norm check using the Phase 8 autograd. (If autograd not yet wired through the block, skip and document.)

Measurements to capture¶

Wall-clock for one block forward on \((T, d_\text{model}) = (8, 64)\). Should be sub-millisecond on Borja's CPU.
Hand-derived vs NumPy match: max absolute error < \(10^{-5}\). Save as experiments/<date>-phase-17-block-by-hand/diff.csv.
Manifest: experiments/<date>-phase-17-block-by-hand/manifest.json per src/utils/seeding.py.

Acceptance¶

layer_norm.py, ffn.py, block.py all exist and pass shape tests.
Hand-derived 2-token, \(d_\text{model}=4\) example matches NumPy to \(10^{-5}\).
Property tests for LayerNorm, FFN, and block pass.
No PyTorch used. No transformers lib used.
Lab notes capture at least one bug caught by the hand-derived check.

Pitfalls to expect¶

GELU approximate vs exact. Standardise on one. If you use scipy's scipy.special.erf-based exact GELU for the hand reference and the approximate tanh-based GELU in code, they disagree by ~0.0001 and your 1e-5 check fails. Use the approximate form in both.
LayerNorm axis. x.mean(axis=-1, keepdims=True), not axis=0. The keepdims is mandatory for broadcasting.
MHA output shape. With \(n_\text{heads} = 1\), MHA collapses to single-head attention, but the output projection \(W_O\) still runs (it's \(d_\text{model} \to d_\text{model}\)). Don't skip it.
__call__ vs forward. Your Phase ⅞ modules used forward(). PyTorch convention is __call__ wrapping forward. Stay consistent with the rest of src/minimodel.

Next: 01-assemble-mini-gpt.md