English · Español
Phase 13 — Embeddings & Representation Spaces¶
Requires: 11 — Tokenization Theory + BPE Implementation · 12 — The Corpus: Designing the Microscopic Dataset Teaches:
embeddings·cbow·cosine-similarity·dimensionality·representation-geometryJump to any chapter from the phase reference index.
Chapter map¶
Pre-written per A12. Theory and lab statements are stable drafts; solutions are written just-in-time at phase open.
🇪🇸 "El significado es geometría." Las embeddings son coordenadas aprendidas: tokens que aparecen en contextos parecidos quedan cerca en el espacio. Aquí las construimos a mano sobre
minigrad/minitorchy las entrenamos con CBOW sobre el corpus de verbos de Phase 12 — sintransformers.
Anchors: LYNX_CORTEX.md §4 / PHASE 13, LYNX_CORTEX_ADDENDUM.md §A13 (English verb grammar scope), §A12 (pre-write).
Goal¶
Borja writes a hand-built embedding lookup module and trains tiny CBOW-style embeddings on the Phase 12 verb-grammar corpus (~600 forms: 20 verbs × 5 tenses × 3 persons × English + Spanish). The phase headline artefact is a committed 2D UMAP scatter of the trained vocab in which:
- All five tense forms of the same verb cluster together (e.g.,
work, works, worked, worked, will workform a small cloud, distinct fromeat, eats, ate, eaten, will eat). - English and Spanish paired forms sit near each other (
work / trabajar,worked / trabajó). - A tense axis emerges — past forms separable from future forms by a direction in the 2D projection.
If that geometry doesn't emerge, the corpus, the tokenizer, or the training has a bug — the visualization is a diagnostic, not eye-candy.
What you build here¶
- A working understanding of why dense embeddings beat one-hot for downstream learning.
- A hand-built
Embeddingmodule using the Phase ⅞ autograd. - A trained CBOW embedding table on the Phase 12 corpus.
- A 2D UMAP projection + cosine-similarity heatmap that show the geometry.
What this phase does NOT cover¶
- Transformer self-attention. Phase 15. Static embeddings here; contextual representations later.
- Subword pooling for variable-length sequences. Phase 15 / Phase 17.
transformerslibrary embeddings (AutoModel.embeddings). Forbidden byCLAUDE.md§0.4 until Phase 24.- Pretrained embeddings (GloVe, fastText). Out of scope — we train from our own corpus.
- Sentence embeddings / sentence-BERT. Phase 29 (RAG / retrieval).
- Vector databases. Phase 29.
- Negative sampling, hierarchical softmax. Mentioned in theory; not implemented (our vocab is small enough for full softmax).
Phase 13's scope is hand-built static token embeddings trained on the Phase 12 verb-form corpus, with a visualization that proves the corpus shaped the geometry. Nothing more.
Read order¶
theory/00-motivation.md— why dense > one-hot for any downstream learning.theory/01-embedding-as-lookup.md—E[i] = one_hot(i) @ E. The lookup-is-matmul identity.theory/02-cbow-skipgram.md— Word2Vec family at the survey level + the CBOW objective we implement.theory/03-similarity-and-visualization.md— cosine vs Euclidean; PCA vs UMAP; what to read into 2D projections (and what not to).lab/00-implement-embedding-module.md— writesrc/minimodel/embedding.py. Forward + gradient + save/load.lab/01-train-cbow.md— train tiny embeddings on the Phase 12 corpus. Hyperparams:d = 32, window = 4, epochs = 20.lab/02-visualize-and-probe.md— UMAP to 2D, cosine-vs-Euclidean comparison, tense-axis probing.
solutions/ is empty during pre-write — populated at phase open.
Definition of Done¶
Briefly (see PHASE_13_PLAN.md for the full list at repo root):
src/minimodel/embedding.pymypy --strictclean; tests cover lookup, gradient flow, save/load.- Trained embeddings on the Phase 12 corpus committed as
.npy+ JSON metadata. - 2D UMAP scatter shows interpretable clustering by tense / verb / language.
- Cosine and Euclidean rankings of
work's top-10 neighbors differ — committed comparison. - Cosine-similarity heatmap of the 20 verbs' infinitive forms committed.
Next: theory/00-motivation.md
Further reading¶
Optional — enrichment, not required to pass the phase.
- 📄 Efficient Estimation of Word Representations (word2vec) — Mikolov et al. · 2013. where the geometry of meaning began.
- 📄 GloVe: Global Vectors for Word Representation — Pennington, Socher, Manning · 2014. the count-based counterpart to word2vec.