English · Español
Lab 03 — Extrapolation Compare: Sinusoidal vs Learned vs RoPE¶
Goal: compare the three PE schemes at sequence lengths beyond what their (notional) training length was. Identify the winner for Phase 17.
Estimated time: 60–90 minutes.
Prereq: labs 00, 01, 02 committed.
What you produce¶
A directory experiments/16-extrapolation/ containing:
compare.py— script that runs attention with each PE scheme at \(T \in \{64, 96, 128, 192, 256\}\).attention_pattern_comparison.png— 3-row, N-column grid of attention heatmaps.extrapolation_metric.png— entropy-of-attention vs \(T\) for each scheme.manifest.json.README.md.
Background¶
In a real training setup, you train at one context length (say \(T = 64\)) and test extrapolation by running inference at \(T > 64\). Phase 16 doesn't train, so we use untrained models and simulate "trained at T=64" by:
- Learned PE: allocate a
(64, d)learned matrix. For \(T > 64\), no extension is possible — the model literally has no embedding for positions 64+. We'll handle this by either (a) wrapping around (pos % 64) or (b) using the position 63 embedding for all positions ≥ 64. Document the choice. - Sinusoidal PE: works for any T — it's a formula.
- RoPE: works for any T — it's a formula.
The "extrapolation" question becomes: how does the attention pattern change as we increase \(T\) beyond what's likely to have been seen? Are nearby positions still attended to? Do far positions become noise?
TODOs¶
Block A — set up¶
- Use Phase 15's
MultiHeadAttention(d_model=32, n_heads=2, seed=0). Single layer, no training. - Use a fixed input embedding:
X[t] = e_twheree_t ~ N(0, I), seeded. - Three PE schemes:
- Sinusoidal: from
src/minimodel/positional/sinusoidal.py. - Learned: from
src/minimodel/positional/learned.pywithT_max=64. For positions ≥ 64, use the position-63 embedding (document this). - RoPE: integrated into the attention forward (modify
MultiHeadAttentionfor this experiment, or wrap).
Block B — sweep T and record attention patterns¶
- For each \(T \in \{64, 96, 128, 192, 256\}\) and each PE scheme:
- Generate \(X \in \mathbb{R}^{T \times 32}\).
- Apply PE (or RoPE inside attention).
- Run forward, capture the attention matrix for head 0.
- Compute the entropy of each row: \(H_i = -\sum_j A_{ij} \log A_{ij}\).
Block C — plot 1: attention pattern grid¶
- 3 rows (one per PE scheme), 5 columns (one per \(T\)).
- Each cell is the \(T \times T\) attention heatmap.
- Annotate: rows = PE scheme, columns = \(T\).
- Save as
attention_pattern_comparison.png.
Expected observations:
- Sinusoidal: attention patterns look reasonable up to \(T = 128\), become noisier at \(T = 256\) (long-range structure not in training).
- Learned: at \(T = 64\) patterns are valid; at \(T > 64\) the model attends weirdly (rows 64+ all look like "row 63" because that's the only position embedding).
- RoPE: patterns scale gracefully with \(T\) — no visible degradation.
Block D — plot 2: average attention entropy¶
- For each \(T\) and each PE scheme, compute mean attention entropy \(\bar{H}(T) = \frac{1}{T} \sum_i H_i\).
- Plot \(\bar{H}(T)\) vs \(T\). Three curves.
- Save as
extrapolation_metric.png.
Lower entropy = more peaked attention (model is confident about who to attend to). Higher entropy = uniform = the model is "lost".
Expected: - RoPE: roughly constant entropy with \(T\) (graceful extrapolation). - Sinusoidal: entropy rises with \(T\) (attention degrades). - Learned: entropy spikes for \(T > 64\) (broken).
Block E — write up¶
In README.md, answer:
- Which scheme has the most stable attention patterns across \(T\)? RoPE — argue with reference to plots.
- Where does learned PE fail catastrophically? At \(T > T_\text{max}\).
- Recommendation for Phase 17. Based on these results, recommend RoPE for Phase 17's Mini-GPT. If RoPE proves too complex, fallback to sinusoidal. Do not use learned PE.
Block F — manifest¶
{
"experiment": "16-extrapolation",
"date": "YYYY-MM-DD",
"seed": 0,
"versions": { "python": "3.11.x", "numpy": "X.Y.Z", "matplotlib": "X.Y.Z" },
"config": {
"d_model": 32,
"n_heads": 2,
"T_sweep": [64, 96, 128, 192, 256],
"learned_T_max": 64,
"trained": false
},
"schemes_compared": ["sinusoidal", "learned", "rope"],
"phase_17_recommendation": "rope_with_sinusoidal_fallback"
}
Constraints¶
- No training. This is a comparison of architectural property, not learned behavior. With random weights, all three schemes produce noisy attention; the structure is what differs.
- No PyTorch.
- Document the learned-PE-beyond-T_max handling clearly. Some readers will think the wrap-around is the "learned PE limitation" — clarify that the limitation is even worse without wrap-around (silently undefined).
Stop conditions¶
Done when:
- All six files committed.
- Both plots saved.
README.mdmakes the Phase 17 recommendation explicitly.
Pitfalls¶
- No training means no real "PE comparison". This lab compares architectural stability, not learned performance. The trained comparison happens implicitly in Phase 17/18. Be honest about this in
README.md. - Entropy can be misleading. A perfectly uniform attention has high entropy and the model is "lost". A degenerate attention (one entry near 1) has low entropy and the model is "confident but possibly wrong". Use entropy as a shape indicator, not a quality measure.
mha.forwarddoesn't currently accept a RoPE option. For this lab, either monkey-patch the forward or wrap it in a helper. Document the hack. The clean integration withMultiHeadAttentionhappens in Phase 17.
When to consult solutions/¶
After all six files committed and the Phase 17 recommendation is made. Solution at solutions/03-extrapolation-ref.md.
End of Phase 16 labs. Write PHASE_16_REPORT.md (include the Phase 17 PE recommendation prominently), fill learners/borja/phase-16/reflections.md.
The next phase assembles all of this into a working transformer block. The PE choice locks in Phase 17 — make it carefully.