English · Español
Lab 00 — Implement the Embedding module¶
Read
theory/00-motivation.mdandtheory/01-embedding-as-lookup.md. Do not consultsolutions/.
Objective¶
Implement an Embedding module backed by the Phase 8 autograd tensor. Lookup must be cheap (no one-hot materialisation), the backward pass must correctly handle duplicate-id batches via np.add.at, and the module must support save / load of trained embeddings.
Setup¶
A new file src/minimodel/embedding.py. Tests in tests/test_phase13_embedding.py. Uses your Phase ⅞ Parameter and Tensor types.
Tasks¶
Task 1 — Embedding class¶
class Embedding:
"""Look up dense vectors by integer id. Backed by a (V, d) parameter matrix."""
def __init__(self, num_embeddings: int, embedding_dim: int, init_scale: float = 0.02):
self.num_embeddings = num_embeddings
self.embedding_dim = embedding_dim
self.E = Parameter(np.random.randn(num_embeddings, embedding_dim) * init_scale)
def __call__(self, ids: NDArray[np.int64]) -> Tensor:
"""ids: (B,) or (B, T) int → (B, d) or (B, T, d) tensor with grad."""
...
def save(self, path: pathlib.Path) -> None:
"""Save the embedding matrix as a .npy file plus a small JSON manifest."""
@classmethod
def load(cls, path: pathlib.Path) -> "Embedding":
"""Load from save()."""
Constraints:
- Pure NumPy + your autograd. No PyTorch.
init_scale = 0.02matches GPT-2's embedding init.- Validate
idsshape: any-dim int array of values in[0, num_embeddings). Out-of-range ids should raiseIndexError. save / loadmust round-trip exactly (bit-equal) and preserve the autogradParameterwrapping.
Task 2 — gradient wiring¶
The autograd's gather op must correctly produce a gradient with shape (num_embeddings, embedding_dim) from an upstream gradient with shape ids.shape + (embedding_dim,). Critical: use np.add.at, not +=, for the scatter-back.
def gather_backward(upstream_grad: NDArray, ids: NDArray, shape: tuple) -> NDArray:
grad_E = np.zeros(shape, dtype=upstream_grad.dtype)
np.add.at(grad_E, ids, upstream_grad) # CRITICAL: not grad_E[ids] += upstream_grad
return grad_E
Why np.add.at? Consider ids = [3, 3] (the same id twice) with upstream [[1, 2], [4, 5]]. We need row 3 of grad_E to be [1+4, 2+5] = [5, 7]. The naive grad_E[ids] += upstream_grad writes [1, 2] then overwrites with [4, 5] because of NumPy's buffered semantics. np.add.at does unbuffered addition and gives the correct [5, 7].
Task 3 — property tests¶
In tests/test_phase13_embedding.py:
- Shape-check.
Embedding(64, 32)(np.array([1, 2, 3]))returns a Tensor of shape(3, 32). - Lookup correctness.
Embedding(V, d)(np.array([i]))equalsE.value[i:i+1]. - Out-of-range raises.
Embedding(64, 32)(np.array([100]))raisesIndexError. - Gradient flows. Build a tiny graph:
y = Embedding(64, 32)(ids).sum(); backprop; assert thatE.gradis nonzero exactly at the rows inids. - Duplicate-id gradient (the critical test). With
ids = [3, 3]and upstreamgrad_y = ones((2, d)), assertE.grad[3]is[2, 2, ..., 2](sum), not[1, 1, ..., 1](last-write-wins). - Save / load round-trip. Train embeddings briefly, save, load into a fresh
Embedding, assert bit-equal to the original.
Task 4 — benchmark¶
Time the lookup against the naive one-hot @ matmul:
ids = np.random.randint(0, 64, size=8192)
E = Embedding(64, 32)
t0 = time.perf_counter()
for _ in range(1000):
out = E(ids)
t1 = time.perf_counter()
# Compare to materialised one-hot:
def slow_embed(ids, E_value):
one_hot = np.zeros((len(ids), E_value.shape[0]), dtype=np.float64)
one_hot[np.arange(len(ids)), ids] = 1
return one_hot @ E_value
t2 = time.perf_counter()
for _ in range(1000):
out_slow = slow_embed(ids, E.E.value)
t3 = time.perf_counter()
Expected: (t1 - t0) is ~100-1000× smaller than (t3 - t2). Save to experiments/<date>-phase-13-embedding/lookup_timing.csv.
Measurements to capture¶
- All 6 property tests passing.
- Lookup timing vs one-hot timing.
np.add.atcorrectness test result.
Acceptance¶
-
src/minimodel/embedding.pyexists;mypy --strictclean. - All 6 property tests pass.
- Duplicate-id gradient test passes (most subtle bug).
- Lookup is at least 100× faster than the one-hot baseline.
- Save / load round-trip is bit-exact.
Pitfalls to expect¶
grad_E[ids] += gsilently wrong for duplicate ids. This is the canonical bug. The test in Task 3.5 will catch it; do not skip that test.- Float dtype mismatch. If your autograd uses float32 but the embedding init produces float64, gradient accumulation may downcast and lose precision. Pick one dtype (probably float64 for the curriculum, float32 in real systems) and be consistent.
load()re-randomises. Easy mistake:Embedding(num, dim)re-initialises in__init__, thenloadis supposed to overwriteself.E.value. If you forget the overwrite, you get a fresh random matrix instead of the saved one.- JSON manifest vs binary file. Save the float matrix as
.npy(binary, exact) and metadata (shape, dtype, version) as.json(text). Don't try to serialise the matrix to JSON — you'll lose precision via float-to-string round-tripping.
Next: 01-train-cbow.md