English · Español
Phase 16 — Quiz (human-readable mirror)¶
🇪🇸 Espejo legible del canónico
data/quizzes/phase-16-positional-encodings.yaml.
Source: data/quizzes/phase-16-positional-encodings.yaml.
q-16-01 — Why is attention permutation-equivariant without PE? (single)¶
softmax(P Q K^T P^T / sqrt(d_k)) P V = P softmax(Q K^T / sqrt(d_k)) V✓softmax(Q K^T / sqrt(d_k)) V = softmax(Q^T K / sqrt(d_k)) Vsoftmax(Q K^T) V = V softmax(Q K^T)- softmax is invariant under any unitary transformation
Permutation
Pcommutes through the entire attention block — no operation depends on absolute position.
q-16-02 — What does RoPE encode that sinusoidal does not? (multi)¶
- RoPE encodes relative position via a dot-product identity ✓
- RoPE is applied inside the attention computation, not at the input ✓
- RoPE has learnable parameters; sinusoidal does not
- RoPE extrapolates better to lengths beyond the training distribution ✓
Both RoPE and vanilla sinusoidal have zero learnable parameters.
q-16-03 — Find the bug: cos(PE[t], PE[t+1]) ≈ 0 (free)¶
Expected to contain: smooth.
Sinusoidal PE has smooth phase relationships — adjacent positions should be cosine-similar. Orthogonal adjacent rows mean PE was shuffled or randomized.
q-16-04 — When does RoPE win on §A13? (single)¶
- Sinusoidal (it's the original)
- RoPE (relative position extrapolates better) ✓
- They are identical in extrapolation
- Learned PE with weight decay
RoPE's relative-position identity stays in-distribution for any length. Sinusoidal at T=20 is OOD.
q-16-05 — RoPE: relative-position identity (single)¶
For RoPE, the dot product (R_θ(t) q) · (R_θ(s) k) depends only on which quantity?
t + st - s✓t × smax(t, s)
R_θ(t)^T R_θ(s) = R_θ(s - t). The dot product depends only on the relative offset. Su et al. 2021 §3.4.