English · Español
Lab 01 — Train CBOW embeddings on the verb-grammar corpus¶
Read
theory/02-cbow-skipgram.md. Do not consultsolutions/.
Objective¶
Train tiny CBOW embeddings on the Phase 12 verb-grammar corpus. After 20 epochs with \(d = 32, k = 2\) (window 4 total), the loss should drop from ~\(\log V\) to ~2.0 and the resulting embedding matrix should encode enough structure for Lab 02's visualization to show tense / verb / language clustering.
Setup¶
src/minimodel/embedding.pyfrom Lab 00.- The Phase 12 corpus and tokenizer. Encode the corpus into a flat array of token ids.
- A new training script:
scripts/phase13_train_cbow.py.
Tasks¶
Task 1 — build the CBOW dataset¶
From the encoded corpus, produce (context, center) pairs:
def make_cbow_pairs(tokens: NDArray[np.int64], window: int = 2) -> tuple[NDArray, NDArray]:
"""
For each position t in tokens, take the 2*window tokens around it as context
and the token at t as the center. Skip positions where the full context window
doesn't fit (i.e., start and end of the corpus).
Returns:
contexts: (N, 2*window) int64
centers: (N,) int64
"""
Pad-or-skip choice: skip is cleaner for our corpus (we don't need every token to be a center).
For a corpus of \(L\) tokens with window 2: \(N = L - 4\) pairs.
Task 2 — model¶
class CBOWModel:
def __init__(self, vocab_size: int, embedding_dim: int):
self.embed = Embedding(vocab_size, embedding_dim)
# Output projection: separate matrix W_out, not tied to E (Word2Vec convention).
self.W_out = Parameter(np.random.randn(vocab_size, embedding_dim) * 0.02)
self.b_out = Parameter(np.zeros(vocab_size))
def __call__(self, contexts: NDArray[np.int64]) -> Tensor:
"""contexts: (B, 2*window) → logits: (B, vocab_size)"""
h = self.embed(contexts).mean(axis=1) # (B, d)
logits = h @ self.W_out.T + self.b_out # (B, vocab_size)
return logits
Note: the output matrix W_out is separate from the input embedding E. Word2Vec doesn't tie them. (Phase 17 transformers do tie input embedding to LM head, but that's different.)
Task 3 — training loop¶
Use Phase 05's cross_entropy_from_logits (or fused with softmax in Phase 18's training code if available; for now write it inline if needed).
def train(model, contexts, centers, epochs=20, batch_size=64, lr=0.01, momentum=0.9):
"""SGD with momentum. Log per-epoch loss. Return per-epoch loss array."""
n = len(contexts)
losses = []
for epoch in range(epochs):
perm = np.random.permutation(n)
epoch_loss = 0.0
for batch_start in range(0, n, batch_size):
idx = perm[batch_start:batch_start + batch_size]
logits = model(contexts[idx])
loss = cross_entropy_from_logits_batch(logits, centers[idx]).mean()
loss.backward()
sgd_step(model.parameters(), lr=lr, momentum=momentum)
epoch_loss += float(loss.value) * len(idx)
epoch_loss /= n
losses.append(epoch_loss)
print(f"epoch {epoch:3d}: loss = {epoch_loss:.4f}")
return losses
Constraints:
- Seed everything via
src/utils/seeding.py. - Manifest the run: hyperparams, corpus path, vocab size, embedding dim, final loss. Save to
experiments/<date>-phase-13-cbow/manifest.json. - Save the trained
Embedding(viaEmbedding.save) at the end.
Task 4 — sanity tests on the trained embeddings¶
Before the visualization lab, verify training did something:
- Loss curve sanity. Plot loss per epoch. Should be monotone decreasing (perhaps noisy). Save as
experiments/<date>-phase-13-cbow/loss_curve.png. - Frequent vs rare token norms. Compute \(\|E[i]\|\) for every token; sort by frequency. The most-frequent tokens should have larger norm than the least-frequent ones. (This is the norm-frequency correlation we'll mitigate with cosine similarity in Lab 02.)
- Top-5 cosine-nearest neighbors of
work. Should include verbs in similar contexts (walk,talk) and should not be dominated by punctuation. If it is, increase epochs or check tokenization.
Task 5 — record the headline metrics¶
Save to experiments/<date>-phase-13-cbow/summary.json:
{
"epochs": 20,
"embedding_dim": 32,
"window": 2,
"vocab_size": 64,
"corpus_size_tokens": 2400,
"final_loss": 2.07,
"initial_loss": 4.16,
"loss_drop": 2.09,
"top5_nearest_work_cosine": ["walk", "talk", "study", "play", "watch"],
"top5_nearest_work_euclidean": [".", ",", "the", "a", "is"]
}
(These numbers are illustrative; your actual results will vary.)
Acceptance¶
- CBOW dataset built: \(N\) pairs of shape \((2k,)\) context + scalar center.
- CBOW model trains for 20 epochs without error.
- Loss drops from ~\(\log V = 4.16\) to under 2.5.
- Loss curve plot saved.
- Top-5 cosine nearest neighbors of
workare mostly verbs (not punctuation). - Trained embedding saved via
Embedding.save. - Manifest committed.
Pitfalls to expect¶
- Forgetting
mean(axis=1)over contexts. If you sum without averaging, the magnitudes drift with window size and the loss is unstable. W_outtied toEby accident. If you writeself.W_out = self.embed.E, you've tied them. CBOW conventionally doesn't tie; tying changes the loss surface. Stay untied for this lab.- Learning rate too high. With \(V = 64, d = 32\), the loss surface is forgiving, but
lr=1.0can still diverge in the first epoch. Start atlr=0.01; tune if needed. - Skipping positions instead of pad-skip. When
window = 2, you have to skip the first 2 and last 2 positions of the corpus. If you don't, you'll index out of bounds. - Logging confusion: loss in nats vs bits. We use nats throughout (per Phase 05). Don't sneak
np.log2in here.