English · Español

Lab 03 — Extrapolation Compare: Sinusoidal vs Learned vs RoPE¶

Goal: compare the three PE schemes at sequence lengths beyond what their (notional) training length was. Identify the winner for Phase 17.

Estimated time: 60–90 minutes.

Prereq: labs 00, 01, 02 committed.

What you produce¶

A directory experiments/16-extrapolation/ containing:

compare.py — script that runs attention with each PE scheme at \(T \in \{64, 96, 128, 192, 256\}\).
attention_pattern_comparison.png — 3-row, N-column grid of attention heatmaps.
extrapolation_metric.png — entropy-of-attention vs \(T\) for each scheme.
manifest.json.
README.md.

Background¶

In a real training setup, you train at one context length (say \(T = 64\)) and test extrapolation by running inference at \(T > 64\). Phase 16 doesn't train, so we use untrained models and simulate "trained at T=64" by:

Learned PE: allocate a (64, d) learned matrix. For \(T > 64\), no extension is possible — the model literally has no embedding for positions 64+. We'll handle this by either (a) wrapping around (pos % 64) or (b) using the position 63 embedding for all positions ≥ 64. Document the choice.
Sinusoidal PE: works for any T — it's a formula.
RoPE: works for any T — it's a formula.

The "extrapolation" question becomes: how does the attention pattern change as we increase \(T\) beyond what's likely to have been seen? Are nearby positions still attended to? Do far positions become noise?

TODOs¶

Block A — set up¶

Use Phase 15's MultiHeadAttention(d_model=32, n_heads=2, seed=0). Single layer, no training.
Use a fixed input embedding: X[t] = e_t where e_t ~ N(0, I), seeded.
Three PE schemes:
Sinusoidal: from src/minimodel/positional/sinusoidal.py.
Learned: from src/minimodel/positional/learned.py with T_max=64. For positions ≥ 64, use the position-63 embedding (document this).
RoPE: integrated into the attention forward (modify MultiHeadAttention for this experiment, or wrap).

Block B — sweep T and record attention patterns¶

For each \(T \in \{64, 96, 128, 192, 256\}\) and each PE scheme:
Generate \(X \in \mathbb{R}^{T \times 32}\).
Apply PE (or RoPE inside attention).
Run forward, capture the attention matrix for head 0.
Compute the entropy of each row: \(H_i = -\sum_j A_{ij} \log A_{ij}\).

Block C — plot 1: attention pattern grid¶

3 rows (one per PE scheme), 5 columns (one per \(T\)).
Each cell is the \(T \times T\) attention heatmap.
Annotate: rows = PE scheme, columns = \(T\).
Save as attention_pattern_comparison.png.

Expected observations:

Sinusoidal: attention patterns look reasonable up to \(T = 128\), become noisier at \(T = 256\) (long-range structure not in training).
Learned: at \(T = 64\) patterns are valid; at \(T > 64\) the model attends weirdly (rows 64+ all look like "row 63" because that's the only position embedding).
RoPE: patterns scale gracefully with \(T\) — no visible degradation.

Block D — plot 2: average attention entropy¶

For each \(T\) and each PE scheme, compute mean attention entropy \(\bar{H}(T) = \frac{1}{T} \sum_i H_i\).
Plot \(\bar{H}(T)\) vs \(T\). Three curves.
Save as extrapolation_metric.png.

Lower entropy = more peaked attention (model is confident about who to attend to). Higher entropy = uniform = the model is "lost".

Expected: - RoPE: roughly constant entropy with \(T\) (graceful extrapolation). - Sinusoidal: entropy rises with \(T\) (attention degrades). - Learned: entropy spikes for \(T > 64\) (broken).

Block E — write up¶

In README.md, answer:

Which scheme has the most stable attention patterns across \(T\)? RoPE — argue with reference to plots.
Where does learned PE fail catastrophically? At \(T > T_\text{max}\).
Recommendation for Phase 17. Based on these results, recommend RoPE for Phase 17's Mini-GPT. If RoPE proves too complex, fallback to sinusoidal. Do not use learned PE.

Block F — manifest¶

{
  "experiment": "16-extrapolation",
  "date": "YYYY-MM-DD",
  "seed": 0,
  "versions": { "python": "3.11.x", "numpy": "X.Y.Z", "matplotlib": "X.Y.Z" },
  "config": {
    "d_model": 32,
    "n_heads": 2,
    "T_sweep": [64, 96, 128, 192, 256],
    "learned_T_max": 64,
    "trained": false
  },
  "schemes_compared": ["sinusoidal", "learned", "rope"],
  "phase_17_recommendation": "rope_with_sinusoidal_fallback"
}

Constraints¶

No training. This is a comparison of architectural property, not learned behavior. With random weights, all three schemes produce noisy attention; the structure is what differs.
No PyTorch.
Document the learned-PE-beyond-T_max handling clearly. Some readers will think the wrap-around is the "learned PE limitation" — clarify that the limitation is even worse without wrap-around (silently undefined).

Stop conditions¶

Done when:

All six files committed.
Both plots saved.
README.md makes the Phase 17 recommendation explicitly.

Pitfalls¶

No training means no real "PE comparison". This lab compares architectural stability, not learned performance. The trained comparison happens implicitly in Phase 17/18. Be honest about this in README.md.
Entropy can be misleading. A perfectly uniform attention has high entropy and the model is "lost". A degenerate attention (one entry near 1) has low entropy and the model is "confident but possibly wrong". Use entropy as a shape indicator, not a quality measure.
mha.forward doesn't currently accept a RoPE option. For this lab, either monkey-patch the forward or wrap it in a helper. Document the hack. The clean integration with MultiHeadAttention happens in Phase 17.

When to consult `solutions/`¶

After all six files committed and the Phase 17 recommendation is made. Solution at solutions/03-extrapolation-ref.md.

End of Phase 16 labs. Write PHASE_16_REPORT.md (include the Phase 17 PE recommendation prominently), fill learners/borja/phase-16/reflections.md.

The next phase assembles all of this into a working transformer block. The PE choice locks in Phase 17 — make it carefully.