English · Español

Phase 13 — Embeddings & Representation Spaces¶

Requires: 11 — Tokenization Theory + BPE Implementation · 12 — The Corpus: Designing the Microscopic Dataset Teaches: embeddings · cbow · cosine-similarity · dimensionality · representation-geometry Jump to any chapter from the phase reference index.

Chapter map¶

Pre-written per A12. Theory and lab statements are stable drafts; solutions are written just-in-time at phase open.

🇪🇸 "El significado es geometría." Las embeddings son coordenadas aprendidas: tokens que aparecen en contextos parecidos quedan cerca en el espacio. Aquí las construimos a mano sobre minigrad/minitorch y las entrenamos con CBOW sobre el corpus de verbos de Phase 12 — sin transformers.

Anchors: LYNX_CORTEX.md §4 / PHASE 13, LYNX_CORTEX_ADDENDUM.md §A13 (English verb grammar scope), §A12 (pre-write).

Goal¶

Borja writes a hand-built embedding lookup module and trains tiny CBOW-style embeddings on the Phase 12 verb-grammar corpus (~600 forms: 20 verbs × 5 tenses × 3 persons × English + Spanish). The phase headline artefact is a committed 2D UMAP scatter of the trained vocab in which:

All five tense forms of the same verb cluster together (e.g., work, works, worked, worked, will work form a small cloud, distinct from eat, eats, ate, eaten, will eat).
English and Spanish paired forms sit near each other (work / trabajar, worked / trabajó).
A tense axis emerges — past forms separable from future forms by a direction in the 2D projection.

If that geometry doesn't emerge, the corpus, the tokenizer, or the training has a bug — the visualization is a diagnostic, not eye-candy.

What you build here¶

A working understanding of why dense embeddings beat one-hot for downstream learning.
A hand-built Embedding module using the Phase ⅞ autograd.
A trained CBOW embedding table on the Phase 12 corpus.
A 2D UMAP projection + cosine-similarity heatmap that show the geometry.

What this phase does NOT cover¶

Transformer self-attention. Phase 15. Static embeddings here; contextual representations later.
Subword pooling for variable-length sequences. Phase 15 / Phase 17.
transformers library embeddings (AutoModel.embeddings). Forbidden by CLAUDE.md §0.4 until Phase 24.
Pretrained embeddings (GloVe, fastText). Out of scope — we train from our own corpus.
Sentence embeddings / sentence-BERT. Phase 29 (RAG / retrieval).
Vector databases. Phase 29.
Negative sampling, hierarchical softmax. Mentioned in theory; not implemented (our vocab is small enough for full softmax).

Phase 13's scope is hand-built static token embeddings trained on the Phase 12 verb-form corpus, with a visualization that proves the corpus shaped the geometry. Nothing more.

Read order¶

theory/00-motivation.md — why dense > one-hot for any downstream learning.
theory/01-embedding-as-lookup.md — E[i] = one_hot(i) @ E. The lookup-is-matmul identity.
theory/02-cbow-skipgram.md — Word2Vec family at the survey level + the CBOW objective we implement.
theory/03-similarity-and-visualization.md — cosine vs Euclidean; PCA vs UMAP; what to read into 2D projections (and what not to).
lab/00-implement-embedding-module.md — write src/minimodel/embedding.py. Forward + gradient + save/load.
lab/01-train-cbow.md — train tiny embeddings on the Phase 12 corpus. Hyperparams: d = 32, window = 4, epochs = 20.
lab/02-visualize-and-probe.md — UMAP to 2D, cosine-vs-Euclidean comparison, tense-axis probing.

solutions/ is empty during pre-write — populated at phase open.

Definition of Done¶

Briefly (see PHASE_13_PLAN.md for the full list at repo root):

src/minimodel/embedding.py mypy --strict clean; tests cover lookup, gradient flow, save/load.
Trained embeddings on the Phase 12 corpus committed as .npy + JSON metadata.
2D UMAP scatter shows interpretable clustering by tense / verb / language.
Cosine and Euclidean rankings of work's top-10 neighbors differ — committed comparison.
Cosine-similarity heatmap of the 20 verbs' infinitive forms committed.

Next: theory/00-motivation.md