English · Español

Phase 13 — Quiz (human-readable mirror)¶

🇪🇸 Espejo legible del canónico data/quizzes/phase-13-embeddings.yaml.

Source: data/quizzes/phase-13-embeddings.yaml.

q-13-01 — Lookup-is-matmul identity (single)¶

Given an embedding table E of shape (V, d) and a one-hot vector x_i (length V), which expression equals E[i]?

x_i @ E^T
x_i @ E ✓
E @ x_i^T
E[i, :].sum()

x_i @ E selects the i-th row because the one-hot has a 1 only at index i. This is why nn.Embedding is implemented as a gather (correct and efficient).

q-13-02 — Why tied embeddings? (multi)¶

Halves the parameter count of those two layers ✓
Empirically improves perplexity by ~5% (Press & Wolf 2017) ✓
Lets you swap in a pretrained embedding without retraining the LM head
Forces input and output vocab sizes to match ✓

Tying halves params, improves perplexity per Press & Wolf 2017, and forces V_in == V_out (this is why breaks happen during refactors).

q-13-03 — Effective rank: when is it healthy? (free)¶

Expected to contain: collapse.

r_eff = 22 << min(V, d) = 128 means the embedding has collapsed. Most tokens occupy a tiny subspace, hurting expressivity. Weight decay 0.1 is the standard fix.

q-13-04 — Find the bug: shape mismatch with tied embeddings (single)¶

A refactor changes the tokenizer's vocab from 512 to 256 but the Embedding module still has vocab_size=512. The LM head is tied. What shape does logits have on a (B=4, T=8) input?

(4, 8, 256)
(4, 8, 512) ✓
(4, 8, 128)
(4, 8)

The matmul uses embedding.weight.T with shape (d, 512), producing logits (B, T, 512). Cross-entropy expects 256; shape mismatch crashes the run.

q-13-05 — Embedding gradient sparsity (single)¶

In CBOW training with batch size B and a one-hot per example (no repeats), how many rows of E receive a non-zero gradient per backward step?

All V rows
Exactly B rows ✓
Exactly d rows
Exactly 1 row

Each example touches only its embedded token's row. This sparsity is why mature frameworks store embedding gradients as IndexedSlices or scatter updates.