Skip to content

English · Español

Phase 16 — Quiz (human-readable mirror)

🇪🇸 Espejo legible del canónico data/quizzes/phase-16-positional-encodings.yaml.

Source: data/quizzes/phase-16-positional-encodings.yaml.


q-16-01 — Why is attention permutation-equivariant without PE? (single)

  • softmax(P Q K^T P^T / sqrt(d_k)) P V = P softmax(Q K^T / sqrt(d_k)) V
  • softmax(Q K^T / sqrt(d_k)) V = softmax(Q^T K / sqrt(d_k)) V
  • softmax(Q K^T) V = V softmax(Q K^T)
  • softmax is invariant under any unitary transformation

Permutation P commutes through the entire attention block — no operation depends on absolute position.


q-16-02 — What does RoPE encode that sinusoidal does not? (multi)

  • RoPE encodes relative position via a dot-product identity
  • RoPE is applied inside the attention computation, not at the input
  • RoPE has learnable parameters; sinusoidal does not
  • RoPE extrapolates better to lengths beyond the training distribution

Both RoPE and vanilla sinusoidal have zero learnable parameters.


q-16-03 — Find the bug: cos(PE[t], PE[t+1]) ≈ 0 (free)

Expected to contain: smooth.

Sinusoidal PE has smooth phase relationships — adjacent positions should be cosine-similar. Orthogonal adjacent rows mean PE was shuffled or randomized.


q-16-04 — When does RoPE win on §A13? (single)

  • Sinusoidal (it's the original)
  • RoPE (relative position extrapolates better)
  • They are identical in extrapolation
  • Learned PE with weight decay

RoPE's relative-position identity stays in-distribution for any length. Sinusoidal at T=20 is OOD.


q-16-05 — RoPE: relative-position identity (single)

For RoPE, the dot product (R_θ(t) q) · (R_θ(s) k) depends only on which quantity?

  • t + s
  • t - s
  • t × s
  • max(t, s)

R_θ(t)^T R_θ(s) = R_θ(s - t). The dot product depends only on the relative offset. Su et al. 2021 §3.4.