Skip to content

English · Español

Lab 03 — Extrapolation Compare: Sinusoidal vs Learned vs RoPE

Goal: compare the three PE schemes at sequence lengths beyond what their (notional) training length was. Identify the winner for Phase 17.

Estimated time: 60–90 minutes.

Prereq: labs 00, 01, 02 committed.


What you produce

A directory experiments/16-extrapolation/ containing:

  • compare.py — script that runs attention with each PE scheme at \(T \in \{64, 96, 128, 192, 256\}\).
  • attention_pattern_comparison.png — 3-row, N-column grid of attention heatmaps.
  • extrapolation_metric.png — entropy-of-attention vs \(T\) for each scheme.
  • manifest.json.
  • README.md.

Background

In a real training setup, you train at one context length (say \(T = 64\)) and test extrapolation by running inference at \(T > 64\). Phase 16 doesn't train, so we use untrained models and simulate "trained at T=64" by:

  • Learned PE: allocate a (64, d) learned matrix. For \(T > 64\), no extension is possible — the model literally has no embedding for positions 64+. We'll handle this by either (a) wrapping around (pos % 64) or (b) using the position 63 embedding for all positions ≥ 64. Document the choice.
  • Sinusoidal PE: works for any T — it's a formula.
  • RoPE: works for any T — it's a formula.

The "extrapolation" question becomes: how does the attention pattern change as we increase \(T\) beyond what's likely to have been seen? Are nearby positions still attended to? Do far positions become noise?

TODOs

Block A — set up

  • Use Phase 15's MultiHeadAttention(d_model=32, n_heads=2, seed=0). Single layer, no training.
  • Use a fixed input embedding: X[t] = e_t where e_t ~ N(0, I), seeded.
  • Three PE schemes:
  • Sinusoidal: from src/minimodel/positional/sinusoidal.py.
  • Learned: from src/minimodel/positional/learned.py with T_max=64. For positions ≥ 64, use the position-63 embedding (document this).
  • RoPE: integrated into the attention forward (modify MultiHeadAttention for this experiment, or wrap).

Block B — sweep T and record attention patterns

  • For each \(T \in \{64, 96, 128, 192, 256\}\) and each PE scheme:
  • Generate \(X \in \mathbb{R}^{T \times 32}\).
  • Apply PE (or RoPE inside attention).
  • Run forward, capture the attention matrix for head 0.
  • Compute the entropy of each row: \(H_i = -\sum_j A_{ij} \log A_{ij}\).

Block C — plot 1: attention pattern grid

  • 3 rows (one per PE scheme), 5 columns (one per \(T\)).
  • Each cell is the \(T \times T\) attention heatmap.
  • Annotate: rows = PE scheme, columns = \(T\).
  • Save as attention_pattern_comparison.png.

Expected observations:

  • Sinusoidal: attention patterns look reasonable up to \(T = 128\), become noisier at \(T = 256\) (long-range structure not in training).
  • Learned: at \(T = 64\) patterns are valid; at \(T > 64\) the model attends weirdly (rows 64+ all look like "row 63" because that's the only position embedding).
  • RoPE: patterns scale gracefully with \(T\) — no visible degradation.

Block D — plot 2: average attention entropy

  • For each \(T\) and each PE scheme, compute mean attention entropy \(\bar{H}(T) = \frac{1}{T} \sum_i H_i\).
  • Plot \(\bar{H}(T)\) vs \(T\). Three curves.
  • Save as extrapolation_metric.png.

Lower entropy = more peaked attention (model is confident about who to attend to). Higher entropy = uniform = the model is "lost".

Expected: - RoPE: roughly constant entropy with \(T\) (graceful extrapolation). - Sinusoidal: entropy rises with \(T\) (attention degrades). - Learned: entropy spikes for \(T > 64\) (broken).

Block E — write up

In README.md, answer:

  1. Which scheme has the most stable attention patterns across \(T\)? RoPE — argue with reference to plots.
  2. Where does learned PE fail catastrophically? At \(T > T_\text{max}\).
  3. Recommendation for Phase 17. Based on these results, recommend RoPE for Phase 17's Mini-GPT. If RoPE proves too complex, fallback to sinusoidal. Do not use learned PE.

Block F — manifest

{
  "experiment": "16-extrapolation",
  "date": "YYYY-MM-DD",
  "seed": 0,
  "versions": { "python": "3.11.x", "numpy": "X.Y.Z", "matplotlib": "X.Y.Z" },
  "config": {
    "d_model": 32,
    "n_heads": 2,
    "T_sweep": [64, 96, 128, 192, 256],
    "learned_T_max": 64,
    "trained": false
  },
  "schemes_compared": ["sinusoidal", "learned", "rope"],
  "phase_17_recommendation": "rope_with_sinusoidal_fallback"
}

Constraints

  • No training. This is a comparison of architectural property, not learned behavior. With random weights, all three schemes produce noisy attention; the structure is what differs.
  • No PyTorch.
  • Document the learned-PE-beyond-T_max handling clearly. Some readers will think the wrap-around is the "learned PE limitation" — clarify that the limitation is even worse without wrap-around (silently undefined).

Stop conditions

Done when:

  1. All six files committed.
  2. Both plots saved.
  3. README.md makes the Phase 17 recommendation explicitly.

Pitfalls

  • No training means no real "PE comparison". This lab compares architectural stability, not learned performance. The trained comparison happens implicitly in Phase 17/18. Be honest about this in README.md.
  • Entropy can be misleading. A perfectly uniform attention has high entropy and the model is "lost". A degenerate attention (one entry near 1) has low entropy and the model is "confident but possibly wrong". Use entropy as a shape indicator, not a quality measure.
  • mha.forward doesn't currently accept a RoPE option. For this lab, either monkey-patch the forward or wrap it in a helper. Document the hack. The clean integration with MultiHeadAttention happens in Phase 17.

When to consult solutions/

After all six files committed and the Phase 17 recommendation is made. Solution at solutions/03-extrapolation-ref.md.


End of Phase 16 labs. Write PHASE_16_REPORT.md (include the Phase 17 PE recommendation prominently), fill learners/borja/phase-16/reflections.md.

The next phase assembles all of this into a working transformer block. The PE choice locks in Phase 17 — make it carefully.