English · Español
Phase 13 — Quiz (human-readable mirror)¶
🇪🇸 Espejo legible del canónico
data/quizzes/phase-13-embeddings.yaml.
Source: data/quizzes/phase-13-embeddings.yaml.
q-13-01 — Lookup-is-matmul identity (single)¶
Given an embedding table E of shape (V, d) and a one-hot vector x_i (length V), which expression equals E[i]?
x_i @ E^Tx_i @ E✓E @ x_i^TE[i, :].sum()
x_i @ Eselects the i-th row because the one-hot has a 1 only at indexi. This is whynn.Embeddingis implemented as a gather (correct and efficient).
q-13-02 — Why tied embeddings? (multi)¶
- Halves the parameter count of those two layers ✓
- Empirically improves perplexity by ~5% (Press & Wolf 2017) ✓
- Lets you swap in a pretrained embedding without retraining the LM head
- Forces input and output vocab sizes to match ✓
Tying halves params, improves perplexity per Press & Wolf 2017, and forces
V_in == V_out(this is why breaks happen during refactors).
q-13-03 — Effective rank: when is it healthy? (free)¶
Expected to contain: collapse.
r_eff = 22 << min(V, d) = 128means the embedding has collapsed. Most tokens occupy a tiny subspace, hurting expressivity. Weight decay 0.1 is the standard fix.
q-13-04 — Find the bug: shape mismatch with tied embeddings (single)¶
A refactor changes the tokenizer's vocab from 512 to 256 but the Embedding module still has vocab_size=512. The LM head is tied. What shape does logits have on a (B=4, T=8) input?
(4, 8, 256)(4, 8, 512)✓(4, 8, 128)(4, 8)
The matmul uses
embedding.weight.Twith shape(d, 512), producing logits(B, T, 512). Cross-entropy expects 256; shape mismatch crashes the run.
q-13-05 — Embedding gradient sparsity (single)¶
In CBOW training with batch size B and a one-hot per example (no repeats), how many rows of E receive a non-zero gradient per backward step?
- All
Vrows - Exactly
Brows ✓ - Exactly
drows - Exactly 1 row
Each example touches only its embedded token's row. This sparsity is why mature frameworks store embedding gradients as
IndexedSlicesor scatter updates.