English · Español
Lab 00 — Attention by Hand¶
Goal: derive a 2-token, single-head attention computation on paper, then implement single-head attention in NumPy and verify the two agree to 1e-5.
Estimated time: 90–120 minutes.
Prereq: all five
theory/files read.
What you produce¶
A directory experiments/15-attention-by-hand/ containing:
paper_derivation.md— your handwritten or typed step-by-step derivation of the toy example below. Numbers, not symbols.attention.py— your NumPy implementation, importing fromsrc/minimodel/attention/attention.py.verify.py— script that runs your implementation on the toy example and asserts agreement with the paper numbers.verify_output.txt— captured printout showing both sets of numbers and the per-element difference.manifest.json.README.md(1–2 paragraphs).
The toy example¶
Tokens: \(T = 2\). Embedding dim: \(d = 2\). Per-head dim: \(d_k = d_v = 2\) (so single-head fills the full dimension).
Inputs:
Weights (chosen to give whole-number intermediate values when possible):
No mask. Single head. Scaled dot-product attention.
TODOs¶
Block A — derive on paper¶
In paper_derivation.md, without writing any code, compute step by step:
- \(Q = X W_Q\) — what's the matrix?
- \(K = X W_K\) — what's the matrix?
- \(V = X W_V\) — what's the matrix?
- \(S = Q K^\top\) — what's the matrix?
- \(S / \sqrt{d_k} = S / \sqrt{2}\) — divide elementwise.
- Apply softmax row-wise. For row 0: \(\text{softmax}((s_{00}, s_{01}) / \sqrt{2})\). Use the stability rewrite (subtract max).
- Repeat for row 1.
- Multiply \(A V\). Show the resulting matrix.
Write out every intermediate matrix with all four entries filled in.
Block B — NumPy implementation¶
Before writing code, read src/minimodel/attention/BLUEPRINT.md. Then in src/minimodel/attention/attention.py:
- Implement
single_head_attention(Q: np.ndarray, K: np.ndarray, V: np.ndarray, mask: np.ndarray | None = None) -> np.ndarray. - Use the stable softmax (
softmax_stablehelper, or inline it). - Five lines max in the body. If you find yourself writing 20 lines, you're over-thinking.
Block C — verify¶
In verify.py:
- Set up the toy inputs and weights exactly as above.
- Compute Q, K, V using NumPy.
- Call
single_head_attention(Q, K, V). - Compare element-wise to your paper numbers. The max element-wise difference must be
< 1e-5. - Print both matrices side-by-side, with the per-element diff. Capture to
verify_output.txt.
Block D — explore: what happens without the scaling?¶
The variance argument in theory/02-scaled-dot-product.md says we divide by \(\sqrt{d_k}\) to prevent softmax saturation when \(d_k\) is large. Let's see it.
- In
verify.py, run the toy example with \(d_k = 64\) instead of \(d_k = 2\) (use \(X = \mathcal{N}(0, 1)\), \(W_*\) random orthogonal). Run twice — once with the \(/\sqrt{d_k}\) scaling, once without. - For each, print the attention matrix
A. The max entry of \(A\) per row should be: - Scaled: close to \(1/T\) if the queries are roughly orthogonal — softmax is doing its job.
- Unscaled: very close to 1.0 — one position dominates, softmax has saturated.
- Confirm this in the printout. Note it in
README.md.
Block E — manifest¶
{
"experiment": "15-attention-by-hand",
"date": "YYYY-MM-DD",
"seed": 42,
"versions": { "python": "3.11.x", "numpy": "X.Y.Z" },
"results_summary": {
"max_abs_diff_paper_vs_code": null,
"softmax_max_entry_scaled_d_k_64": null,
"softmax_max_entry_unscaled_d_k_64": null
}
}
Constraints¶
- No PyTorch. (Anti-goal §10.)
- Paper first, code second. If you write the NumPy first and then "derive" the paper version, you've defeated the lab. The point is to know what answer the math predicts before running anything.
- Stable softmax. Use max-subtraction. No naive
exp(x) / sum(exp(x)).
Stop conditions¶
Done when:
- All six files committed.
max_abs_diff_paper_vs_code < 1e-5.- The unscaled \(d_k = 64\) case clearly shows softmax saturation (max entry per row > 0.95).
README.mddescribes both findings in 2–3 sentences each.
Pitfalls¶
- Softmax of \((s_0, s_1)\) at \(s_0 = s_1\) should give \((0.5, 0.5)\), not \((1, 0)\). Sanity-check by hand.
np.sqrt(d_k)is a Python float; you can divide a numpy matrix by it directly. Don't construct a numpy array for a scalar.- Numerical precision in fp32. Your max diff might be 1e-7 or 1e-6 depending on the order of operations. 1e-5 is the locked threshold.
W_Kis not transposed at the weight level. Don't try to "fix" the asymmetry by transposing — the asymmetry is the point (seetheory/01-query-key-value.md).
When to consult solutions/¶
After all six files committed and assertions pass. Solution at solutions/00-attention-by-hand-ref.md.
Next lab: 01-multi-head-attention.md.