English · Español

02 — CBOW and Skip-Gram (Word2Vec at survey level)¶

🇪🇸 Word2Vec (2013) plantó la semilla: un token se conoce por su compañía. Predice el centro a partir del contexto (CBOW) o el contexto a partir del centro (Skip-Gram). Implementamos CBOW por gradientes más estables, pero el principio — entrenar embeddings con un objetivo predictivo sobre el corpus — es lo que importa.

The two Word2Vec objectives¶

Mikolov et al. 2013 ("Efficient Estimation of Word Representations in Vector Space") introduced two complementary architectures for learning embeddings:

CBOW — Continuous Bag of Words¶

Predict the center token from a window of context tokens.

Context:   [I, work, ?, every, day]    (window size 4, two tokens each side)
Center:    "hard"
Prediction: model says which token most likely sits in the ?

Skip-Gram¶

Predict the context from the center.

Center:   "work"
Context:  [I, hard, every, day]    (predict each independently)

Both train the same embedding matrix \(E\). The difference is the structure of the loss.

Why we implement CBOW¶

Three reasons:

Cleaner gradient signal. Averaging context embeddings before the projection gives a single, well-conditioned target. Skip-Gram's gradient is summed over multiple independent (context | center) pairs and tends to be noisier on small corpora.
More efficient with our tiny corpus. CBOW takes one update per center token; Skip-Gram takes \(2k\) updates (for window \(k\)).
The pedagogy is the same. Once Borja has the CBOW intuition, Skip-Gram is a 10-minute mental swap (and we sketch it below).

The CBOW math¶

Let \(E \in \mathbb{R}^{V \times d}\) be the input embedding matrix and \(W \in \mathbb{R}^{V \times d}\) be the output projection (typically untied from \(E\) in Word2Vec; we'll tie them in Phase 17 transformers). For a context window of size \(k\) (tokens on each side, so \(2k\) context tokens total) around center token \(t\):

Look up the context embeddings:

\[h = \frac{1}{2k} \sum_{j \in \text{context}(t)} E[j] \quad \in \mathbb{R}^d\]

(Average the embeddings of the context tokens.)

Project to logits:

\[z = h \cdot W^\top + b \quad \in \mathbb{R}^V\]

Softmax + cross-entropy loss:

\[\mathcal{L}_t = -\log \text{softmax}(z)_t = -z_t + \log \sum_j \exp(z_j)\]

(Phase 05's cross_entropy_from_logits.)

The loss over the corpus is the sum (or mean) over all centers \(t\).

The Skip-Gram math (sketch)¶

For comparison only — not implemented in this phase. Center \(t\), predict each context word independently:

\[\mathcal{L}_t = -\sum_{j \in \text{context}(t)} \log \text{softmax}(E[t] W^\top + b)_j\]

Each context word is a separate cross-entropy term using the same prediction \(E[t] W^\top + b\).

Skip-Gram tends to do better on rare words because every rare word gets multiple updates per occurrence. CBOW tends to do better on frequent words because the averaging stabilises common contexts. Our corpus is small and well-balanced (every verb appears equally often in templates); CBOW is the right choice.

Negative sampling (out of scope here)¶

The softmax in the loss has a denominator over all \(V\) tokens. For our \(V = 64\), this is trivial. For Word2Vec's original setup (\(V = 10^6\)), it's expensive — hence negative sampling: replace the full softmax with a binary classification against \(k\) random "negative" tokens. We do not implement negative sampling; the corpus is small enough for the full softmax.

Mentioned here so that:

When you read Word2Vec papers / code, you recognise the technique.
When you scale up later, you know which optimisation to reach for.

The window size \(k\)¶

We use \(k = 2\) (so 4 context tokens around each center). Choice of \(k\):

\(k\) small (1-2): captures local syntactic structure. Good for our short, templated sentences.
\(k\) large (5-10): captures more diffuse "topic" co-occurrence. Used by Word2Vec on news corpora.

For the §A13 corpus — short sentences like I work every day . — \(k = 2\) gives a context that includes the subject pronoun and a temporal marker. Larger windows would include other sentences and dilute the signal.

The embedding dimension \(d\)¶

The rule of thumb from PHASE_13_PLAN.md: \(d \approx V^{1/4}\) for tiny vocabs. For \(V = 64\), that's \(d \approx 3\). We use \(d = 32\) — over-parameterised by 10× — for two reasons:

Headroom for the downstream model. Phase 17's transformer expects \(d_\text{model} = 64\), and the embedding will eventually be tied as a row of that matrix. Starting at \(d = 32\) keeps things manageable; if we need to expand to 64 later, we can re-train or zero-pad.
Visualization fidelity. UMAP projection from \(d = 3\) to \(d = 2\) is a near-trivial squash; from \(d = 32\) to \(d = 2\) gives UMAP enough room to find structure.

The expected training dynamics¶

For our corpus, on a CBOW with \(d = 32, k = 2\), full-softmax loss, SGD with momentum:

Epoch 1: loss drops from \(\log V \approx 4.16\) (uniform random) to ~3.5 as the model learns that some tokens (the, a, ,) are more frequent than others.
Epoch 5–10: loss drops to ~2.5 as the model learns subject-verb co-occurrence (I work, she works, etc.).
Epoch 20: loss ~2.0. The geometry is now structured enough to plot.

If your loss plateaus much above \(\log V \approx 4.16\), you have a bug — probably gradient flow not reaching \(E\). (Lab 00's gradient test catches this.)

If your loss drops to ~0, you've overfit — the embeddings memorise position-specific co-occurrence. For our small corpus, slight overfitting is fine because the geometry (the thing we care about) is robust to it.

What this file does NOT cover¶

GloVe. Another classic embedding method (Pennington et al. 2014). Out of scope.
fastText, subword embeddings. Phase 11 handled subword tokenization; we don't re-do it here.
Contextual embeddings (BERT, GPT). Phase 15+. CBOW gives one vector per token, no matter the context.
Negative sampling implementation. Mentioned above; not coded.