English · Español

Lab 00 — Attention by Hand¶

Goal: derive a 2-token, single-head attention computation on paper, then implement single-head attention in NumPy and verify the two agree to 1e-5.

Estimated time: 90–120 minutes.

Prereq: all five theory/ files read.

What you produce¶

A directory experiments/15-attention-by-hand/ containing:

paper_derivation.md — your handwritten or typed step-by-step derivation of the toy example below. Numbers, not symbols.
attention.py — your NumPy implementation, importing from src/minimodel/attention/attention.py.
verify.py — script that runs your implementation on the toy example and asserts agreement with the paper numbers.
verify_output.txt — captured printout showing both sets of numbers and the per-element difference.
manifest.json.
README.md (1–2 paragraphs).

The toy example¶

Tokens: \(T = 2\). Embedding dim: \(d = 2\). Per-head dim: \(d_k = d_v = 2\) (so single-head fills the full dimension).

Inputs:

\[ X = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} \]

Weights (chosen to give whole-number intermediate values when possible):

\[ W_Q = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}, \quad W_K = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}, \quad W_V = \begin{pmatrix} 2 & 0 \\ 0 & 3 \end{pmatrix} \]

No mask. Single head. Scaled dot-product attention.

TODOs¶

Block A — derive on paper¶

In paper_derivation.md, without writing any code, compute step by step:

\(Q = X W_Q\) — what's the matrix?
\(K = X W_K\) — what's the matrix?
\(V = X W_V\) — what's the matrix?
\(S = Q K^\top\) — what's the matrix?
\(S / \sqrt{d_k} = S / \sqrt{2}\) — divide elementwise.
Apply softmax row-wise. For row 0: \(\text{softmax}((s_{00}, s_{01}) / \sqrt{2})\). Use the stability rewrite (subtract max).
Repeat for row 1.
Multiply \(A V\). Show the resulting matrix.

Write out every intermediate matrix with all four entries filled in.

Block B — NumPy implementation¶

Before writing code, read src/minimodel/attention/BLUEPRINT.md. Then in src/minimodel/attention/attention.py:

Implement single_head_attention(Q: np.ndarray, K: np.ndarray, V: np.ndarray, mask: np.ndarray | None = None) -> np.ndarray.
Use the stable softmax (softmax_stable helper, or inline it).
Five lines max in the body. If you find yourself writing 20 lines, you're over-thinking.

Block C — verify¶

In verify.py:

Set up the toy inputs and weights exactly as above.
Compute Q, K, V using NumPy.
Call single_head_attention(Q, K, V).
Compare element-wise to your paper numbers. The max element-wise difference must be < 1e-5.
Print both matrices side-by-side, with the per-element diff. Capture to verify_output.txt.

Block D — explore: what happens without the scaling?¶

The variance argument in theory/02-scaled-dot-product.md says we divide by \(\sqrt{d_k}\) to prevent softmax saturation when \(d_k\) is large. Let's see it.

In verify.py, run the toy example with \(d_k = 64\) instead of \(d_k = 2\) (use \(X = \mathcal{N}(0, 1)\), \(W_*\) random orthogonal). Run twice — once with the \(/\sqrt{d_k}\) scaling, once without.
For each, print the attention matrix A. The max entry of \(A\) per row should be:
Scaled: close to \(1/T\) if the queries are roughly orthogonal — softmax is doing its job.
Unscaled: very close to 1.0 — one position dominates, softmax has saturated.
Confirm this in the printout. Note it in README.md.

Block E — manifest¶

{
  "experiment": "15-attention-by-hand",
  "date": "YYYY-MM-DD",
  "seed": 42,
  "versions": { "python": "3.11.x", "numpy": "X.Y.Z" },
  "results_summary": {
    "max_abs_diff_paper_vs_code": null,
    "softmax_max_entry_scaled_d_k_64": null,
    "softmax_max_entry_unscaled_d_k_64": null
  }
}

Constraints¶

No PyTorch. (Anti-goal §10.)
Paper first, code second. If you write the NumPy first and then "derive" the paper version, you've defeated the lab. The point is to know what answer the math predicts before running anything.
Stable softmax. Use max-subtraction. No naive exp(x) / sum(exp(x)).

Stop conditions¶

Done when:

All six files committed.
max_abs_diff_paper_vs_code < 1e-5.
The unscaled \(d_k = 64\) case clearly shows softmax saturation (max entry per row > 0.95).
README.md describes both findings in 2–3 sentences each.

Pitfalls¶

Softmax of \((s_0, s_1)\) at \(s_0 = s_1\) should give \((0.5, 0.5)\), not \((1, 0)\). Sanity-check by hand.
np.sqrt(d_k) is a Python float; you can divide a numpy matrix by it directly. Don't construct a numpy array for a scalar.
Numerical precision in fp32. Your max diff might be 1e-7 or 1e-6 depending on the order of operations. 1e-5 is the locked threshold.
W_K is not transposed at the weight level. Don't try to "fix" the asymmetry by transposing — the asymmetry is the point (see theory/01-query-key-value.md).

When to consult `solutions/`¶

After all six files committed and assertions pass. Solution at solutions/00-attention-by-hand-ref.md.

Next lab: 01-multi-head-attention.md.