English · Español
Lab 00 — One transformer block, by hand¶
Read
theory/01-transformer-block.mdandtheory/02-ffn-and-activations.mdfirst. Do not consultsolutions/.
Objective¶
Implement a single Pre-LN transformer block in NumPy and verify the forward pass against a hand-derived reference on a tiny toy configuration. The point is complete numerical control — if you can't trace one block on a 2-token, \(d_\text{model} = 4\) example, you don't understand it.
Setup¶
A new module: src/minimodel/transformer/. Three files:
Each file gets a BLUEPRINT.md first (you've done this drill in Phases 7, 8, 13, 15 — same pattern). MultiHeadAttention is imported from Phase 15's module.
Tasks¶
Task 1 — LayerNorm¶
In src/minimodel/transformer/layer_norm.py:
class LayerNorm:
"""LayerNorm over the last axis. Parameters: gamma (scale), beta (shift)."""
def __init__(self, d_model: int, eps: float = 1e-5):
...
def __call__(self, x: NDArray[np.float64]) -> NDArray[np.float64]:
"""x: (..., d_model) → (..., d_model). Normalises over the last axis."""
Constraints:
- Pure NumPy. Backward pass is via the Phase 8 autograd tensor — your
LayerNormshould accept and return autograd tensors, not raw arrays. (If your Phase 8 autograd interface is finished — confirm at phase-open.) - Validate shapes: input's last axis must equal
d_model. - Compute \(\mu, \sigma^2\) over the last axis, not over batch or sequence.
Property tests (add to tests/test_phase17_layer_norm.py):
- Zero mean, unit variance after normalisation. For input \(x \sim \mathcal{N}(0, I)\),
LN(x)(with \(\gamma = 1, \beta = 0\)) should have \(\approx 0\) mean and \(\approx 1\) variance along the last axis. - Identity under \(\gamma = 1, \beta = 0\) on already-normalised input.
- Shift / scale recovery. Set \(\gamma = 2, \beta = 5\); verify output mean ≈ 5, variance ≈ 4.
Task 2 — FFN¶
In src/minimodel/transformer/ffn.py:
class FFN:
"""Position-wise feed-forward: GELU(x W1 + b1) W2 + b2."""
def __init__(self, d_model: int, d_ff: int):
...
def __call__(self, x): ...
Constraints:
- Use the approximate GELU:
0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x**3))). - Initialise \(W_1, W_2\) with Xavier (Phase 10 default).
- Biases zero at init.
Task 3 — TransformerBlock¶
In src/minimodel/transformer/block.py:
class TransformerBlock:
def __init__(self, d_model: int, n_heads: int, d_ff: int):
self.ln1 = LayerNorm(d_model)
self.attn = MultiHeadAttention(d_model, n_heads) # from Phase 15
self.ln2 = LayerNorm(d_model)
self.ffn = FFN(d_model, d_ff)
def __call__(self, x):
# Pre-LN: norm → sublayer → residual add
x = x + self.attn(self.ln1(x))
x = x + self.ffn(self.ln2(x))
return x
Task 4 — hand-derived reference¶
Pick a tiny config: \(d_\text{model} = 4, n_\text{heads} = 1, d_\text{ff} = 8\), \(T = 2\) tokens. Initialise all parameters with explicit small values you write down (not random — use a fixed list of fractions like \(0.1, 0.2, \ldots\)). Compute the block's forward pass on paper, step by step:
- \(\text{LN}_1(x)\) — compute \(\mu, \sigma\) per row, normalise, scale, shift.
- \(\text{MHA}(\cdot)\) — Q, K, V projections, dot products, softmax, output projection. (With one head, this is a single attention.)
- Residual: \(z = x + \text{MHA output}\).
- \(\text{LN}_2(z)\) — same procedure as step 1, on \(z\).
- \(\text{FFN}(\cdot)\) — up-project, GELU, down-project.
- Final residual: \(y = z + \text{FFN output}\).
Now run your NumPy implementation on the same inputs and parameters. Assert np.allclose(numpy_result, hand_result, atol=1e-5).
This is tedious. It will take 1-2 hours. Do it anyway. You will catch:
- LayerNorm axis bugs (normalising over the wrong axis).
- GELU approximation errors (using the exact form when you said approximate, or vice versa).
- Residual ordering bugs (
x + LN(attn(x))instead ofx + attn(LN(x))). - Q/K/V shape confusion.
Task 5 — sanity tests¶
Add to tests/test_phase17_block.py:
- Shape preservation. Block input \((T, d_\text{model})\) → output \((T, d_\text{model})\), same shape.
- No-op-at-zero. If all parameters are zero (or close), the block should be near-identity (since both sublayers contribute ~0). Verify \(|\text{block}(x) - x| < 10^{-3}\) for normalised input.
- Differentiability check. A forward + backward + gradient norm check using the Phase 8 autograd. (If autograd not yet wired through the block, skip and document.)
Measurements to capture¶
- Wall-clock for one block forward on \((T, d_\text{model}) = (8, 64)\). Should be sub-millisecond on Borja's CPU.
- Hand-derived vs NumPy match: max absolute error < \(10^{-5}\). Save as
experiments/<date>-phase-17-block-by-hand/diff.csv. - Manifest:
experiments/<date>-phase-17-block-by-hand/manifest.jsonpersrc/utils/seeding.py.
Acceptance¶
-
layer_norm.py,ffn.py,block.pyall exist and pass shape tests. - Hand-derived 2-token, \(d_\text{model}=4\) example matches NumPy to \(10^{-5}\).
- Property tests for LayerNorm, FFN, and block pass.
- No PyTorch used. No
transformerslib used. - Lab notes capture at least one bug caught by the hand-derived check.
Pitfalls to expect¶
- GELU approximate vs exact. Standardise on one. If you use scipy's
scipy.special.erf-based exact GELU for the hand reference and the approximate tanh-based GELU in code, they disagree by ~0.0001 and your1e-5check fails. Use the approximate form in both. - LayerNorm axis.
x.mean(axis=-1, keepdims=True), notaxis=0. Thekeepdimsis mandatory for broadcasting. - MHA output shape. With \(n_\text{heads} = 1\), MHA collapses to single-head attention, but the output projection \(W_O\) still runs (it's \(d_\text{model} \to d_\text{model}\)). Don't skip it.
__call__vsforward. Your Phase ⅞ modules usedforward(). PyTorch convention is__call__wrappingforward. Stay consistent with the rest ofsrc/minimodel.
Next: 01-assemble-mini-gpt.md