Skip to content

English · Español

Phase 13 — Embeddings & Representation Spaces

Requires: 11 — Tokenization Theory + BPE Implementation · 12 — The Corpus: Designing the Microscopic Dataset Teaches: embeddings · cbow · cosine-similarity · dimensionality · representation-geometry Jump to any chapter from the phase reference index.

Chapter map

Pre-written per A12. Theory and lab statements are stable drafts; solutions are written just-in-time at phase open.

🇪🇸 "El significado es geometría." Las embeddings son coordenadas aprendidas: tokens que aparecen en contextos parecidos quedan cerca en el espacio. Aquí las construimos a mano sobre minigrad/minitorch y las entrenamos con CBOW sobre el corpus de verbos de Phase 12 — sin transformers.

Anchors: LYNX_CORTEX.md §4 / PHASE 13, LYNX_CORTEX_ADDENDUM.md §A13 (English verb grammar scope), §A12 (pre-write).


Goal

Borja writes a hand-built embedding lookup module and trains tiny CBOW-style embeddings on the Phase 12 verb-grammar corpus (~600 forms: 20 verbs × 5 tenses × 3 persons × English + Spanish). The phase headline artefact is a committed 2D UMAP scatter of the trained vocab in which:

  • All five tense forms of the same verb cluster together (e.g., work, works, worked, worked, will work form a small cloud, distinct from eat, eats, ate, eaten, will eat).
  • English and Spanish paired forms sit near each other (work / trabajar, worked / trabajó).
  • A tense axis emerges — past forms separable from future forms by a direction in the 2D projection.

If that geometry doesn't emerge, the corpus, the tokenizer, or the training has a bug — the visualization is a diagnostic, not eye-candy.

What you build here

  • A working understanding of why dense embeddings beat one-hot for downstream learning.
  • A hand-built Embedding module using the Phase ⅞ autograd.
  • A trained CBOW embedding table on the Phase 12 corpus.
  • A 2D UMAP projection + cosine-similarity heatmap that show the geometry.

What this phase does NOT cover

  • Transformer self-attention. Phase 15. Static embeddings here; contextual representations later.
  • Subword pooling for variable-length sequences. Phase 15 / Phase 17.
  • transformers library embeddings (AutoModel.embeddings). Forbidden by CLAUDE.md §0.4 until Phase 24.
  • Pretrained embeddings (GloVe, fastText). Out of scope — we train from our own corpus.
  • Sentence embeddings / sentence-BERT. Phase 29 (RAG / retrieval).
  • Vector databases. Phase 29.
  • Negative sampling, hierarchical softmax. Mentioned in theory; not implemented (our vocab is small enough for full softmax).

Phase 13's scope is hand-built static token embeddings trained on the Phase 12 verb-form corpus, with a visualization that proves the corpus shaped the geometry. Nothing more.

Read order

  1. theory/00-motivation.md — why dense > one-hot for any downstream learning.
  2. theory/01-embedding-as-lookup.mdE[i] = one_hot(i) @ E. The lookup-is-matmul identity.
  3. theory/02-cbow-skipgram.md — Word2Vec family at the survey level + the CBOW objective we implement.
  4. theory/03-similarity-and-visualization.md — cosine vs Euclidean; PCA vs UMAP; what to read into 2D projections (and what not to).
  5. lab/00-implement-embedding-module.md — write src/minimodel/embedding.py. Forward + gradient + save/load.
  6. lab/01-train-cbow.md — train tiny embeddings on the Phase 12 corpus. Hyperparams: d = 32, window = 4, epochs = 20.
  7. lab/02-visualize-and-probe.md — UMAP to 2D, cosine-vs-Euclidean comparison, tense-axis probing.

solutions/ is empty during pre-write — populated at phase open.

Definition of Done

Briefly (see PHASE_13_PLAN.md for the full list at repo root):

  • src/minimodel/embedding.py mypy --strict clean; tests cover lookup, gradient flow, save/load.
  • Trained embeddings on the Phase 12 corpus committed as .npy + JSON metadata.
  • 2D UMAP scatter shows interpretable clustering by tense / verb / language.
  • Cosine and Euclidean rankings of work's top-10 neighbors differ — committed comparison.
  • Cosine-similarity heatmap of the 20 verbs' infinitive forms committed.

Next: theory/00-motivation.md

Further reading

Optional — enrichment, not required to pass the phase.