English · Español
Lab 00 — Permutation Equivariance, in Numbers¶
Goal: demonstrate empirically that attention without positional information is permutation-equivariant. Show that adding sinusoidal PE breaks this property.
Estimated time: 30–45 minutes.
Prereq:
theory/00-motivation.mdread. Phase 15'ssrc/minimodel/attention/exists.
What you produce¶
A directory experiments/16-permutation-equivariance/ containing:
demo.py— script that runs attention with and without PE on a 3-token sequence and shows what happens under permutation.demo_output.txt— captured printout.manifest.json.README.md.
TODOs¶
Block A — set up¶
- Use
MultiHeadAttention(d_model=8, n_heads=1, seed=0)from Phase 15. - Input: 3 tokens forming a verb-grammar fragment —
he,work,I(token IDs from your Phase 14 tokenizer; embed via the Phase 13 embedding). Stack as \(X \in \mathbb{R}^{3 \times 8}\). - Linguistic motivation:
he work Iis ungrammatical;I work heis also ungrammatical; only with positional info can the model prefer one ordering over another (or rather: preferhe worksoverworks helater in Phase 18). Without PE, all orderings look the same.
Block B — without PE¶
- Compute
Y = mha.forward(X, mask=None). - Permute the input:
X_perm = X[[2, 0, 1]](swap tokens around). - Compute
Y_perm = mha.forward(X_perm, mask=None). - Assert:
np.allclose(Y_perm, Y[[2, 0, 1]], atol=1e-6).
This proves: the attention output on the permuted input equals the permutation of the attention output on the original input. Equivariance. The model has no way to tell which permutation it received.
- Print
YandY[[2, 0, 1]]andY_permside-by-side. Verify visually.
Block C — with PE¶
- Use sinusoidal PE:
pe = sinusoidal_pe(3, 8)(from Phase 16'ssrc/minimodel/positional/sinusoidal.py). - Compute
Y_pe = mha.forward(X + pe). - Compute
Y_perm_pe = mha.forward(X_perm + pe). - Assert:
not np.allclose(Y_perm_pe, Y_pe[[2, 0, 1]], atol=1e-3).
This proves: with PE, the model does distinguish the permutation. The output is no longer just a reordering.
- Print the diff matrix
Y_perm_pe - Y_pe[[2, 0, 1]](it should have non-trivial entries — the PE has broken equivariance).
Block D — interpret¶
In README.md (1–2 paragraphs), answer:
- Why does the without-PE test pass? State the permutation-equivariance theorem in your own words and reference the 3-token example.
- Why does the with-PE test pass-with-difference? The PE rows are different for different positions; adding them to permuted tokens means each token now carries a position-specific signature that the un-permuted version wouldn't have.
Block E — manifest¶
{
"experiment": "16-permutation-equivariance",
"date": "YYYY-MM-DD",
"seed": 0,
"versions": { "python": "3.11.x", "numpy": "X.Y.Z" },
"config": {
"d_model": 8,
"n_heads": 1,
"T": 3,
"pe_scheme_compared": "sinusoidal"
},
"results_summary": {
"without_PE_equivariance_max_diff": null,
"with_PE_equivariance_max_diff": null
}
}
The without-PE diff should be < 1e-6. The with-PE diff should be > 1e-3.
Constraints¶
- No new code in
src/. Use existingMultiHeadAttentionandsinusoidal_pe. This lab is a demonstration, not an implementation lab. - Seeded. Reproducible.
Stop conditions¶
Done when:
- All four files committed.
- Both assertions pass (one for equivariance without PE, one for non-equivariance with PE).
README.mdexplains the result.
Pitfalls¶
- Permutation index confusion.
X[[2, 0, 1]]means "take row 2, row 0, row 1 in that order". Confirm by printingXandX[[2, 0, 1]]to make sure you understand. - Tolerance.
1e-6for without-PE;1e-3for with-PE (the PE values are O(1), so the diff after attention is non-trivial).
When to consult solutions/¶
After all four files committed. Solution at solutions/00-permutation-equivariance-ref.md.
Next lab: 01-sinusoidal-pe.md.