Skip to content

English · Español

00 — Why dense embeddings beat one-hot

🇪🇸 Una representación one-hot es honesta — el token es su identidad — pero geométricamente plana. Las embeddings densas convierten esa identidad en coordenadas aprendidas, donde "cerca" significa "se comporta parecido en mi corpus". Para los verbos de §A13, eso debería significar: tiempos cerca de tiempos, personas cerca de personas, pares EN/ES cerca entre sí.

The setup

The Phase 12 corpus tokenises the 20 verbs × 5 tenses × 3 persons × 2 languages ≈ 600 verb forms (plus punctuation, special tokens, common pronouns). We need a way for the model to use these tokens — to feed them to the transformer in Phase 17.

The most honest representation is one-hot: each token gets a unique index \(i \in \{0, 1, \ldots, V-1\}\) and an indicator vector \(x \in \{0, 1\}^V\) with a \(1\) at position \(i\). Mathematically clean. Practically useless. Why?

Three reasons one-hot loses

1. Geometry: all tokens are equidistant

In one-hot space, every pair of distinct tokens has the same distance:

\[\|e_i - e_j\|_2 = \sqrt{2} \quad \text{for all } i \neq j\]

So work and worked are exactly as far apart as work and comma. The geometry contains no information about which tokens are similar. A model would have to learn similarity from scratch every time.

2. Parameter cost: linear layers are huge

A linear layer that consumes one-hot input has weight shape \(W \in \mathbb{R}^{V \times d}\). For our \(V = 64, d_\text{model} = 64\), this is 4096 params — manageable. But for GPT-2 (\(V = 50257, d = 768\)), that's 38.5M params just for the first projection. One-hot doesn't scale.

The embedding-as-lookup trick (next file) collapses this: \(W^\top \cdot \text{one\_hot}(i) = W[i, :]\). The math is the same, but we never materialise the one-hot vector — we just index into \(W\).

3. No transferable structure

Suppose the model learns that work should activate some attention heads in a particular way. With one-hot, that knowledge is only about index 14 (or whatever work's index is). The model has zero ability to generalise to worked (index 22, say) — they're orthogonal in one-hot space.

With dense embeddings, the model can learn \(E[\text{work}] \approx E[\text{worked}]\) — and then anything it learns about one transfers to the other.

What dense embeddings give you

A dense embedding is a learned vector \(E[i] \in \mathbb{R}^d\) for each token id \(i\). The whole vocabulary's embeddings form a matrix \(E \in \mathbb{R}^{V \times d}\). Training adjusts \(E\) so that:

  • Tokens that appear in similar contexts have similar vectors. This is the distributional hypothesis (Harris 1954, "you shall know a word by the company it keeps"). For us: work and walk appear in similar slots in sentences, so they should end up near each other in \(E\).
  • Compositional structure can emerge. The famous result: \(E[\text{king}] - E[\text{man}] + E[\text{woman}] \approx E[\text{queen}]\) (Mikolov et al. 2013). For us, we'd hope: \(E[\text{works}] - E[\text{work}] + E[\text{walk}] \approx E[\text{walks}]\) (the "3rd-person-singular" direction). This is not guaranteed and we'll test for it in lab 02.
  • Cross-lingual alignment may appear "for free." If work and trabajar appear in similar templated contexts in the corpus (which they do, by Phase 12's construction), their embeddings should converge. The 2D UMAP should show EN/ES pairs near each other.

What you should expect to see (the headline artefact)

After training CBOW for 20 epochs on the Phase 12 corpus with \(d = 32\), project to 2D with UMAP. Then look:

  1. Each verb's 5 tense forms cluster together. A small cloud per verb.
  2. Across verbs, the cloud's center varies by verb identitywork's cluster and eat's cluster are separable.
  3. A "past-tense" direction is visible. All past forms are shifted in a consistent direction relative to their infinitives.
  4. English/Spanish pairs sit near each other. work and trabajar are close, worked and trabajó are close.

If you don't see (1) and (2), the corpus or training has a bug. (3) and (4) are aspirational — they're the meaningful tests of whether the geometry encodes linguistic structure.

What the embeddings will NOT teach the model

A common over-claim: "embeddings encode meaning." They encode distributional similarity in the training corpus. That's a proxy for meaning, useful but not the thing itself. Our corpus is microscopic; the embeddings won't capture nuances of meaning like "intentionality" or "telicity" that linguistic theory cares about. They'll capture: tense, person, verb identity, language. That's enough for Phase 32's tutor.

This humility is worth maintaining. When you read "GPT-4 understands X" in the press, the technical claim is: "GPT-4's embeddings + attention encode statistical regularities about X that suffice for the test we used." A weaker, more honest claim.

What this file does NOT cover

  • The CBOW training objective. File 02.
  • Why \(d = 32\) specifically. Brief mention in file 02; full discussion in PHASE_13_PLAN.md §2.
  • UMAP internals. UMAP is a black-box tool here. File 03 cites the paper.

Next: 01-embedding-as-lookup.md