Skip to content

English · Español

Lab 01 — Train CBOW embeddings on the verb-grammar corpus

Read theory/02-cbow-skipgram.md. Do not consult solutions/.

Objective

Train tiny CBOW embeddings on the Phase 12 verb-grammar corpus. After 20 epochs with \(d = 32, k = 2\) (window 4 total), the loss should drop from ~\(\log V\) to ~2.0 and the resulting embedding matrix should encode enough structure for Lab 02's visualization to show tense / verb / language clustering.

Setup

  • src/minimodel/embedding.py from Lab 00.
  • The Phase 12 corpus and tokenizer. Encode the corpus into a flat array of token ids.
  • A new training script: scripts/phase13_train_cbow.py.

Tasks

Task 1 — build the CBOW dataset

From the encoded corpus, produce (context, center) pairs:

def make_cbow_pairs(tokens: NDArray[np.int64], window: int = 2) -> tuple[NDArray, NDArray]:
    """
    For each position t in tokens, take the 2*window tokens around it as context
    and the token at t as the center. Skip positions where the full context window
    doesn't fit (i.e., start and end of the corpus).

    Returns:
      contexts: (N, 2*window) int64
      centers:  (N,)           int64
    """

Pad-or-skip choice: skip is cleaner for our corpus (we don't need every token to be a center).

For a corpus of \(L\) tokens with window 2: \(N = L - 4\) pairs.

Task 2 — model

class CBOWModel:
    def __init__(self, vocab_size: int, embedding_dim: int):
        self.embed = Embedding(vocab_size, embedding_dim)
        # Output projection: separate matrix W_out, not tied to E (Word2Vec convention).
        self.W_out = Parameter(np.random.randn(vocab_size, embedding_dim) * 0.02)
        self.b_out = Parameter(np.zeros(vocab_size))

    def __call__(self, contexts: NDArray[np.int64]) -> Tensor:
        """contexts: (B, 2*window) → logits: (B, vocab_size)"""
        h = self.embed(contexts).mean(axis=1)         # (B, d)
        logits = h @ self.W_out.T + self.b_out        # (B, vocab_size)
        return logits

Note: the output matrix W_out is separate from the input embedding E. Word2Vec doesn't tie them. (Phase 17 transformers do tie input embedding to LM head, but that's different.)

Task 3 — training loop

Use Phase 05's cross_entropy_from_logits (or fused with softmax in Phase 18's training code if available; for now write it inline if needed).

def train(model, contexts, centers, epochs=20, batch_size=64, lr=0.01, momentum=0.9):
    """SGD with momentum. Log per-epoch loss. Return per-epoch loss array."""
    n = len(contexts)
    losses = []
    for epoch in range(epochs):
        perm = np.random.permutation(n)
        epoch_loss = 0.0
        for batch_start in range(0, n, batch_size):
            idx = perm[batch_start:batch_start + batch_size]
            logits = model(contexts[idx])
            loss = cross_entropy_from_logits_batch(logits, centers[idx]).mean()
            loss.backward()
            sgd_step(model.parameters(), lr=lr, momentum=momentum)
            epoch_loss += float(loss.value) * len(idx)
        epoch_loss /= n
        losses.append(epoch_loss)
        print(f"epoch {epoch:3d}: loss = {epoch_loss:.4f}")
    return losses

Constraints:

  • Seed everything via src/utils/seeding.py.
  • Manifest the run: hyperparams, corpus path, vocab size, embedding dim, final loss. Save to experiments/<date>-phase-13-cbow/manifest.json.
  • Save the trained Embedding (via Embedding.save) at the end.

Task 4 — sanity tests on the trained embeddings

Before the visualization lab, verify training did something:

  1. Loss curve sanity. Plot loss per epoch. Should be monotone decreasing (perhaps noisy). Save as experiments/<date>-phase-13-cbow/loss_curve.png.
  2. Frequent vs rare token norms. Compute \(\|E[i]\|\) for every token; sort by frequency. The most-frequent tokens should have larger norm than the least-frequent ones. (This is the norm-frequency correlation we'll mitigate with cosine similarity in Lab 02.)
  3. Top-5 cosine-nearest neighbors of work. Should include verbs in similar contexts (walk, talk) and should not be dominated by punctuation. If it is, increase epochs or check tokenization.

Task 5 — record the headline metrics

Save to experiments/<date>-phase-13-cbow/summary.json:

{
  "epochs": 20,
  "embedding_dim": 32,
  "window": 2,
  "vocab_size": 64,
  "corpus_size_tokens": 2400,
  "final_loss": 2.07,
  "initial_loss": 4.16,
  "loss_drop": 2.09,
  "top5_nearest_work_cosine": ["walk", "talk", "study", "play", "watch"],
  "top5_nearest_work_euclidean": [".", ",", "the", "a", "is"]
}

(These numbers are illustrative; your actual results will vary.)

Acceptance

  • CBOW dataset built: \(N\) pairs of shape \((2k,)\) context + scalar center.
  • CBOW model trains for 20 epochs without error.
  • Loss drops from ~\(\log V = 4.16\) to under 2.5.
  • Loss curve plot saved.
  • Top-5 cosine nearest neighbors of work are mostly verbs (not punctuation).
  • Trained embedding saved via Embedding.save.
  • Manifest committed.

Pitfalls to expect

  • Forgetting mean(axis=1) over contexts. If you sum without averaging, the magnitudes drift with window size and the loss is unstable.
  • W_out tied to E by accident. If you write self.W_out = self.embed.E, you've tied them. CBOW conventionally doesn't tie; tying changes the loss surface. Stay untied for this lab.
  • Learning rate too high. With \(V = 64, d = 32\), the loss surface is forgiving, but lr=1.0 can still diverge in the first epoch. Start at lr=0.01; tune if needed.
  • Skipping positions instead of pad-skip. When window = 2, you have to skip the first 2 and last 2 positions of the corpus. If you don't, you'll index out of bounds.
  • Logging confusion: loss in nats vs bits. We use nats throughout (per Phase 05). Don't sneak np.log2 in here.

Next: 02-visualize-and-probe.md