Skip to content

English · Español

Phase 13 — Quiz (human-readable mirror)

🇪🇸 Espejo legible del canónico data/quizzes/phase-13-embeddings.yaml.

Source: data/quizzes/phase-13-embeddings.yaml.


q-13-01 — Lookup-is-matmul identity (single)

Given an embedding table E of shape (V, d) and a one-hot vector x_i (length V), which expression equals E[i]?

  • x_i @ E^T
  • x_i @ E
  • E @ x_i^T
  • E[i, :].sum()

x_i @ E selects the i-th row because the one-hot has a 1 only at index i. This is why nn.Embedding is implemented as a gather (correct and efficient).


q-13-02 — Why tied embeddings? (multi)

  • Halves the parameter count of those two layers
  • Empirically improves perplexity by ~5% (Press & Wolf 2017)
  • Lets you swap in a pretrained embedding without retraining the LM head
  • Forces input and output vocab sizes to match

Tying halves params, improves perplexity per Press & Wolf 2017, and forces V_in == V_out (this is why breaks happen during refactors).


q-13-03 — Effective rank: when is it healthy? (free)

Expected to contain: collapse.

r_eff = 22 << min(V, d) = 128 means the embedding has collapsed. Most tokens occupy a tiny subspace, hurting expressivity. Weight decay 0.1 is the standard fix.


q-13-04 — Find the bug: shape mismatch with tied embeddings (single)

A refactor changes the tokenizer's vocab from 512 to 256 but the Embedding module still has vocab_size=512. The LM head is tied. What shape does logits have on a (B=4, T=8) input?

  • (4, 8, 256)
  • (4, 8, 512)
  • (4, 8, 128)
  • (4, 8)

The matmul uses embedding.weight.T with shape (d, 512), producing logits (B, T, 512). Cross-entropy expects 256; shape mismatch crashes the run.


q-13-05 — Embedding gradient sparsity (single)

In CBOW training with batch size B and a one-hot per example (no repeats), how many rows of E receive a non-zero gradient per backward step?

  • All V rows
  • Exactly B rows
  • Exactly d rows
  • Exactly 1 row

Each example touches only its embedded token's row. This sparsity is why mature frameworks store embedding gradients as IndexedSlices or scatter updates.