English · Español

Lab 02 — Visualize and probe the trained embeddings¶

Read theory/03-similarity-and-visualization.md. Do not consult solutions/.

Objective¶

Project the trained embeddings to 2D with UMAP; produce a scatter plot showing tense / verb / language clustering. Build a cosine-similarity heatmap of the 20 verbs' infinitive forms. Probe for a linear "tense direction" in the embedding space. These visualizations are the diagnostic — if they don't show structure, the corpus or training has a bug.

Setup¶

The saved Embedding from Lab 01.
The Phase 12 tokenizer (to map ids ↔ verb-form strings).
Library: umap-learn (added in pyproject.toml), matplotlib, numpy.

Tasks¶

Task 1 — UMAP projection¶

import umap

emb = Embedding.load(path)
E_matrix = emb.E.value                          # (V, d) = (64, 32)
reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2, random_state=42)
E_2d = reducer.fit_transform(E_matrix)          # (V, 2)

Save the projection as experiments/<date>-phase-13-visualize/umap.npy for reuse.

Task 2 — annotated scatter plot¶

Build a scatter plot where:

Each point is a vocabulary token.
Color encodes verb identity (one color per verb, with a separate "other" color for non-verb tokens).
Marker shape encodes tense (infinitive = circle, present = triangle, past = square, past_participle = diamond, future = star). Non-verb tokens use 'x'.
Each point is labelled with the token string (or a small subset of them for readability).

Save as experiments/<date>-phase-13-visualize/umap_scatter.png (high resolution, at least 1200×900).

Expected pattern:

The 5 tense forms of each verb sit near each other (a small cluster per verb).
English and Spanish forms of the same verb sit near each other (e.g., work near trabajar).
A tense axis is visible — all past forms are shifted in one direction relative to their infinitives.
Non-verb tokens (punctuation, pronouns, articles) form their own cluster, separate from the verbs.

If you don't see (1), the CBOW training didn't capture verb identity — probably too few epochs or LR misset. If you don't see (2), the corpus templates didn't make EN/ES forms interchangeable enough. Document both cases.

Task 3 — cosine-similarity heatmap¶

Compute the pairwise cosine similarity matrix for the 20 verbs' infinitive forms only (so we're comparing verb identity, not tense-mixed).

infinitives = [tokenizer.encode(v)[0] for v in twenty_verbs]
E_inf = E_matrix[infinitives]                    # (20, 32)
norms = np.linalg.norm(E_inf, axis=1, keepdims=True)
E_norm = E_inf / norms
cos_sim = E_norm @ E_norm.T                      # (20, 20)

Plot as a heatmap with verbs as both axes labels. Use a diverging colormap centered at 0. Save as experiments/<date>-phase-13-visualize/cosine_heatmap.png.

Expected: the diagonal is 1.0 (self-similarity). Off-diagonals are positive (the embeddings are mostly in the same hemisphere). Some pairs of verbs should be visibly closer than others: e.g., walk / work / play (all simple actions in similar templates) > walk / be (very different verbs).

Task 4 — cosine vs Euclidean: top-10 neighbors of `work`¶

Compute the top-10 nearest neighbors of work (the infinitive) under both metrics. Tabulate:

Rank	Cosine NN	Cosine sim	Euclidean NN	Euclidean dist
1	walk	0.78	.	0.43
2	talk	0.74	,	0.51
...	...	...	...	...

Save as experiments/<date>-phase-13-visualize/work_neighbors.csv.

Expected: the two columns differ significantly. Cosine gives verbs in similar contexts. Euclidean is dominated by high-frequency tokens (punctuation, articles) because of the norm-frequency confound.

Write 1-2 sentences in your lab notes explaining the difference.

Task 5 — linear probe for tense direction¶

This is the most rigorous test. Procedure:

past_ids        = [tokenizer.encode(f) for f in all_past_forms]
future_ids      = [tokenizer.encode(f) for f in all_future_forms]
infinitive_ids  = [tokenizer.encode(v) for v in twenty_verbs]

mu_past   = E_matrix[past_ids].mean(axis=0)
mu_future = E_matrix[future_ids].mean(axis=0)
direction = mu_past - mu_future
direction = direction / np.linalg.norm(direction)   # normalise

# Project each verb-form token onto the direction.
projections = E_matrix @ direction                  # (V,)

# Sort tokens by projection; check that past tokens cluster at one end,
# future at the other, and infinitives in the middle.

Plot a 1D scatter of projections, colored by tense. Save as experiments/<date>-phase-13-visualize/tense_probe.png.

Pass: the means of the past, infinitive, and future groups are visibly separated, with past at one extreme, future at the other, infinitive between.

Fail: the groups overlap completely. This would suggest the embedding learned per-verb identity but not tense. Document, and note that Phase 18's training of the full mini-GPT will likely fix this.

Task 6 — short discussion in lab notes¶

In learners/borja/phase-13/notes.md, 2-3 paragraphs:

What structure did you find? (Quote specific clusters or directions.)
What surprised you? (E.g., did be cluster with the other verbs, or with the punctuation? Why?)
What would you change about the corpus or training if you ran this again?

This note becomes part of PHASE_13_REPORT.md.

Measurements to capture¶

UMAP projection saved.
Scatter plot saved.
Cosine heatmap saved.
Cosine vs Euclidean neighbor table saved.
Tense-probe 1D scatter saved.
Lab notes written.

Acceptance¶

All five visualizations saved.
At least one verb's five tense forms visibly cluster in the UMAP scatter.
Cosine and Euclidean neighbor lists differ meaningfully.
Tense projection separates past / infinitive / future means.
Notes written.

Pitfalls to expect¶

UMAP non-determinism. Different random_state gives different layouts. Pin random_state=42 for reproducibility. Don't expect the layout to match Lab solutions' visual exactly — only the clustering structure should match.
Few-shot UMAP failures. With only 64 tokens, UMAP can produce strange layouts (over-clustered or over-spread). Try n_neighbors=10 if 15 looks bad; n_neighbors=5 if still bad. Document any tuning.
Cosine sim on zero-norm vectors. If a token never appears in training (some special tokens), its embedding stays at init scale and may have unusually small norm. The cosine similarity is still defined; check for NaN.
Probing as confirmation bias. The "tense direction" probe only works if you predefined the direction before looking at the projection. Don't pick the direction by eyeballing the UMAP and then claim it's evidence. The procedure in Task 5 computes the direction from training labels (tense membership), not from inspection — that's what makes it rigorous.

Next: Phase 14 — Pre-Transformer Sequence Models (after /quiz 13 and /phase-report 13).