English · Español
01 — E[i] = one_hot(i) @ E: the lookup-is-matmul identity¶
🇪🇸 Esta es la identidad más útil de Phase 13. Una búsqueda en una matriz de embeddings es matemáticamente idéntica a un producto matriz-vector con un one-hot. Lo importante: nunca materializamos el one-hot. La implementación es indexación; la matemática es matmul.
The identity¶
Let \(E \in \mathbb{R}^{V \times d}\) be an embedding matrix and \(i \in \{0, 1, \ldots, V-1\}\) a token id. Let \(e_i \in \{0, 1\}^V\) be the one-hot vector with a 1 at position \(i\).
Then:
The left side is indexing (constant time, no math). The right side is a matmul (a row-vector times the matrix). They are literally the same vector. The proof is the definition of matrix multiplication:
since \((e_i)_j = 0\) for \(j \neq i\) and \((e_i)_i = 1\).
Why this matters¶
Practical: implementation choice¶
For the forward pass, never materialise the one-hot. Indexing is \(O(1)\) + memory copy of \(d\) floats. Matmul with a one-hot is \(O(Vd)\) and allocates a \(V\)-dimensional zero vector. Same result, dramatically different cost.
# Slow, materialises one-hot:
def embed_slow(i: int, E: np.ndarray) -> np.ndarray:
one_hot = np.zeros(E.shape[0])
one_hot[i] = 1
return one_hot @ E
# Fast, indexed:
def embed_fast(i: int, E: np.ndarray) -> np.ndarray:
return E[i]
For a batch of \(B\) token ids: E[ids] produces shape (B, d) in \(O(Bd)\) time. This is what the Phase 13 lab implements.
Conceptual: gradient flow¶
Because lookup is matmul, gradient flow through a lookup is well-defined by autograd's standard chain rule. If a downstream loss \(\mathcal{L}\) depends on \(E[i]\) with gradient \(g = \partial \mathcal{L} / \partial E[i] \in \mathbb{R}^d\), then \(\partial \mathcal{L} / \partial E\) is a sparse matrix with row \(i\) equal to \(g\) and all other rows zero.
# In autograd terms:
# Forward: v = E[i]
# Backward: dE = scatter(g, into=row[i], shape=(V, d))
# (all other rows zero)
For a batch of ids \([i_0, i_1, \ldots, i_{B-1}]\) and per-row gradients \(g_b\):
- If all \(i_b\) are distinct, each gradient lands in its own row of \(\partial \mathcal{L} / \partial E\).
- If two batch positions share the same id (e.g., the word
theappears twice in a sentence), their gradients sum into the same row. This is the standard chain rule for "the same parameter used multiple times."
This is the correct (and only) gradient. It's also why lookup tables get sparse-friendly optimizers (Adagrad's per-row adaptive rates, sparse SGD updates) in production — most rows have zero gradient on most steps.
The autograd-compatible implementation sketch¶
In src/minimodel/embedding.py:
class Embedding:
def __init__(self, num_embeddings: int, embedding_dim: int):
# Initialise small Gaussian, scale 0.02 (GPT-style)
self.E = Parameter(np.random.randn(num_embeddings, embedding_dim) * 0.02)
def __call__(self, ids: NDArray[np.int64]) -> Tensor:
"""ids: (B,) or (B, T) int → (B, d) or (B, T, d) tensor with grad."""
return self.E[ids] # autograd handles the gather and the scatter-back
The autograd machinery (Phase 8) must support integer indexing into a Tensor. If your Phase 8 autograd doesn't support this yet, here's the operation to add:
- Forward:
Tensor.gather(ids)returns the rows. - Backward: given upstream gradient \(g\) with shape
(B, d), produces gradient forself.Eof shape(V, d)by scattering:np.add.at(grad_E, ids, g). (Thenp.add.atis unbuffered — handles duplicate ids correctly.)
np.add.at is the right primitive. grad_E[ids] += g (the obvious code) silently drops duplicate-ids gradient contributions. Use np.add.at(grad_E, ids, g). Lab 00 will test this.
Tied input/output embeddings (forward reference)¶
Phase 17 will tie the input embedding matrix with the LM-head output projection. When tied:
- The same parameter \(E\) is used for both the input lookup (\(E[ids]\)) and the output unembed (\(h \cdot E^\top\)).
- Gradients from both contributions sum into \(\partial \mathcal{L} / \partial E\).
This is fine and correct under autograd, but it requires that your tensor library treats "same parameter reused" the standard way (gradient is the sum of all contributions). Lab 00's gradient test will exercise the single-use case; Phase 17 will rely on the same property for tied embeddings.
A common confusion: embedding vs one-hot encoding for labels¶
The lookup-is-matmul identity applies to input embeddings. For targets (the labels in cross-entropy loss), we sometimes use one-hot:
Or equivalently, just \(-\log q_{y^*}\) where \(y^*\) is the integer index. Same math, integer-form is cheaper. Phase 05's cross_entropy_from_logits(z, y_star) uses the integer form.
So: input tokens → integer indices into \(E\) (lookup). Target tokens → integer indices into the loss (no one-hot ever materialised). Symmetric and efficient.
What this file does NOT cover¶
- Positional embeddings. Phase 16. Token embeddings here; positional info added later.
- Multi-token (subword) embeddings. Phase 11 / 12 handled tokenization; here every token is a single integer.
- The training objective. Next file (CBOW).
Next: 02-cbow-skipgram.md