English · Español

Lab 00 — Implement the `Embedding` module¶

Read theory/00-motivation.md and theory/01-embedding-as-lookup.md. Do not consult solutions/.

Objective¶

Implement an Embedding module backed by the Phase 8 autograd tensor. Lookup must be cheap (no one-hot materialisation), the backward pass must correctly handle duplicate-id batches via np.add.at, and the module must support save / load of trained embeddings.

Setup¶

A new file src/minimodel/embedding.py. Tests in tests/test_phase13_embedding.py. Uses your Phase ⅞ Parameter and Tensor types.

Tasks¶

Task 1 — `Embedding` class¶

class Embedding:
    """Look up dense vectors by integer id. Backed by a (V, d) parameter matrix."""

    def __init__(self, num_embeddings: int, embedding_dim: int, init_scale: float = 0.02):
        self.num_embeddings = num_embeddings
        self.embedding_dim = embedding_dim
        self.E = Parameter(np.random.randn(num_embeddings, embedding_dim) * init_scale)

    def __call__(self, ids: NDArray[np.int64]) -> Tensor:
        """ids: (B,) or (B, T) int → (B, d) or (B, T, d) tensor with grad."""
        ...

    def save(self, path: pathlib.Path) -> None:
        """Save the embedding matrix as a .npy file plus a small JSON manifest."""

    @classmethod
    def load(cls, path: pathlib.Path) -> "Embedding":
        """Load from save()."""

Constraints:

Pure NumPy + your autograd. No PyTorch.
init_scale = 0.02 matches GPT-2's embedding init.
Validate ids shape: any-dim int array of values in [0, num_embeddings). Out-of-range ids should raise IndexError.
save / load must round-trip exactly (bit-equal) and preserve the autograd Parameter wrapping.

Task 2 — gradient wiring¶

The autograd's gather op must correctly produce a gradient with shape (num_embeddings, embedding_dim) from an upstream gradient with shape ids.shape + (embedding_dim,). Critical: use np.add.at, not +=, for the scatter-back.

def gather_backward(upstream_grad: NDArray, ids: NDArray, shape: tuple) -> NDArray:
    grad_E = np.zeros(shape, dtype=upstream_grad.dtype)
    np.add.at(grad_E, ids, upstream_grad)   # CRITICAL: not grad_E[ids] += upstream_grad
    return grad_E

Why np.add.at? Consider ids = [3, 3] (the same id twice) with upstream [[1, 2], [4, 5]]. We need row 3 of grad_E to be [1+4, 2+5] = [5, 7]. The naive grad_E[ids] += upstream_grad writes [1, 2] then overwrites with [4, 5] because of NumPy's buffered semantics. np.add.at does unbuffered addition and gives the correct [5, 7].

Task 3 — property tests¶

In tests/test_phase13_embedding.py:

Shape-check. Embedding(64, 32)(np.array([1, 2, 3])) returns a Tensor of shape (3, 32).
Lookup correctness. Embedding(V, d)(np.array([i])) equals E.value[i:i+1].
Out-of-range raises. Embedding(64, 32)(np.array([100])) raises IndexError.
Gradient flows. Build a tiny graph: y = Embedding(64, 32)(ids).sum(); backprop; assert that E.grad is nonzero exactly at the rows in ids.
Duplicate-id gradient (the critical test). With ids = [3, 3] and upstream grad_y = ones((2, d)), assert E.grad[3] is [2, 2, ..., 2] (sum), not [1, 1, ..., 1] (last-write-wins).
Save / load round-trip. Train embeddings briefly, save, load into a fresh Embedding, assert bit-equal to the original.

Task 4 — benchmark¶

Time the lookup against the naive one-hot @ matmul:

ids = np.random.randint(0, 64, size=8192)
E = Embedding(64, 32)

t0 = time.perf_counter()
for _ in range(1000):
    out = E(ids)
t1 = time.perf_counter()

# Compare to materialised one-hot:
def slow_embed(ids, E_value):
    one_hot = np.zeros((len(ids), E_value.shape[0]), dtype=np.float64)
    one_hot[np.arange(len(ids)), ids] = 1
    return one_hot @ E_value

t2 = time.perf_counter()
for _ in range(1000):
    out_slow = slow_embed(ids, E.E.value)
t3 = time.perf_counter()

Expected: (t1 - t0) is ~100-1000× smaller than (t3 - t2). Save to experiments/<date>-phase-13-embedding/lookup_timing.csv.

Measurements to capture¶

All 6 property tests passing.
Lookup timing vs one-hot timing.
np.add.at correctness test result.

Acceptance¶

src/minimodel/embedding.py exists; mypy --strict clean.
All 6 property tests pass.
Duplicate-id gradient test passes (most subtle bug).
Lookup is at least 100× faster than the one-hot baseline.
Save / load round-trip is bit-exact.

Pitfalls to expect¶

grad_E[ids] += g silently wrong for duplicate ids. This is the canonical bug. The test in Task 3.5 will catch it; do not skip that test.
Float dtype mismatch. If your autograd uses float32 but the embedding init produces float64, gradient accumulation may downcast and lose precision. Pick one dtype (probably float64 for the curriculum, float32 in real systems) and be consistent.
load() re-randomises. Easy mistake: Embedding(num, dim) re-initialises in __init__, then load is supposed to overwrite self.E.value. If you forget the overwrite, you get a fresh random matrix instead of the saved one.
JSON manifest vs binary file. Save the float matrix as .npy (binary, exact) and metadata (shape, dtype, version) as .json (text). Don't try to serialise the matrix to JSON — you'll lose precision via float-to-string round-tripping.

Next: 01-train-cbow.md

Lab 00 — Implement the Embedding module¶