English · Español

03 — Similarity (cosine vs Euclidean) and visualization (PCA vs UMAP)¶

🇪🇸 Una embedding sin métrica no significa nada — "cerca" en qué sentido? Aquí explicamos por qué coseno gana sobre Euclídea para embeddings entrenados, por qué UMAP gana sobre PCA para visualizar, y qué se puede y qué no se puede leer en una proyección 2D.

Cosine vs Euclidean¶

Given two vectors \(u, v \in \mathbb{R}^d\), two natural notions of "closeness":

Euclidean distance¶

\[d_\text{euc}(u, v) = \|u - v\|_2 = \sqrt{\sum_{k=1}^d (u_k - v_k)^2}\]

Cosine similarity¶

\[\cos(u, v) = \frac{u \cdot v}{\|u\| \, \|v\|} \in [-1, 1]\]

The cosine is the cosine of the angle between \(u\) and \(v\) — it's scale-invariant: scaling \(u\) by 100× doesn't change \(\cos(u, v)\). Euclidean is not: scaling \(u\) by 100× changes the distance dramatically.

Why cosine usually wins for embeddings¶

After training, the embedding norms \(\|E[i]\|\) tend to correlate with token frequency: high-frequency tokens get pulled toward the center of the embedding distribution and develop large norms (because they're updated many times). Low-frequency tokens stay near the initialisation, with small norms.

If we use Euclidean distance to find "similar" tokens:

Two high-frequency tokens that share no meaningful similarity (the, a) will be close because they both have large norm and point in vaguely the central direction.
A high-frequency token and a semantically-similar low-frequency token will appear far apart because of the norm mismatch.

Cosine factors out norm, so it measures direction in embedding space — which is what trained embeddings encode meaningfully.

Lab evidence¶

Lab 02 will demonstrate this concretely: for the verb work, find the top-10 nearest neighbors under each metric and inspect:

Euclidean top-10: likely dominated by high-frequency function words (the, ,, .).
Cosine top-10: should give verbs in similar contexts (walk, talk, study).

If they don't differ, your training hasn't injected meaningful direction into the geometry yet.

When Euclidean is the right choice¶

Euclidean wins when you have a fixed-norm representation (e.g., L2-normalised activations), where cosine is identical to (1 minus half) the Euclidean distance squared. In that case Euclidean is cheaper to compute and equivalent. For raw embeddings, cosine is the default.

PCA vs UMAP for 2D visualization¶

The trained embedding matrix is \(E \in \mathbb{R}^{V \times d}\) with \(d = 32\) in our case. We can't plot it directly; we need to project to \(d = 2\) (or \(d = 3\)) for inspection. Two common reducers:

PCA — Principal Component Analysis¶

Linear projection onto the directions of maximum variance. Mathematically clean: solve an eigenvalue problem on \(E^\top E\), take the top-2 eigenvectors. Pros:

Deterministic; no random seed.
Linear, so the projection is interpretable — you can write down the projection matrix.
Fast.

Cons:

Linear. If meaningful structure is non-linear (e.g., the "past-tense" direction curves through the manifold), PCA can't recover it.
Variance-maximising can be dominated by the "most-frequent-tokens" axis, which is dull.

UMAP — Uniform Manifold Approximation and Projection¶

(McInnes et al. 2018.) A non-linear method based on building a fuzzy graph of nearest neighbors and optimising a low-dim layout that preserves the graph structure. Pros:

Captures non-linear structure.
Tends to produce visually compelling clusters that match human intuition about similarity.
Faster than t-SNE for moderate-sized data.

Cons:

Stochastic — different random seeds give different layouts.
Distances in UMAP are not meaningful. Don't trust "embedding A is 2x closer to B than to C in the UMAP plot" — UMAP optimises local neighborhood structure, not global distance.
A black box — we use it but don't derive it.

What we use¶

We use UMAP for the headline scatter (because it visually reveals clustering), and cosine similarity in the original \(d = 32\) space for the quantitative claims (e.g., "the top-10 nearest neighbors of work are..."). The two methods are complementary: UMAP for seeing, cosine for measuring.

What can and cannot be read from a 2D projection¶

You can read:

Cluster structure. "These tokens form a tight cloud, those form another."
Categorical separation. "All past-tense forms sit on the left half of the plot."
Outliers. "This token is far from everything else — what's going on?"

You cannot read:

Pairwise distance magnitudes. The 2D distances are not the original distances. A token can be visually next to another but have a low cosine similarity in the original space.
Direction semantics. A 2D axis doesn't necessarily correspond to a meaningful direction in the original space. The horizontal axis isn't "tense"; you have to check.
Smoothness or continuity. UMAP can introduce or destroy local continuity in ways that don't match the underlying geometry.

When you present a UMAP plot, always pair it with a quantitative claim verified in the original embedding space (e.g., a cosine-similarity table).

Probing for a "tense direction"¶

A more rigorous test than "look at the UMAP": probe whether a specific direction in embedding space corresponds to a linguistic feature. Procedure:

Compute the centroid of all past-tense forms: \(\mu_\text{past} = \frac{1}{N_\text{past}} \sum E[\text{past forms}]\).
Compute the centroid of all future-tense forms: \(\mu_\text{future} = \frac{1}{N_\text{future}} \sum E[\text{future forms}]\).
The direction \(v = \mu_\text{past} - \mu_\text{future}\) is the "past-vs-future" axis in the embedding space.
Project every token onto \(v\): \(E[i] \cdot v / \|v\|\).
Check that past-tense tokens get high projections, future-tense tokens get low projections, and infinitives sit in the middle.

If this works, you've found a linear probe for tense — a real, interpretable structural fact about your embeddings. If it doesn't, the embedding learned something else (e.g., per-verb identity dominates and tense is sub-dominant).

This is a Phase 13 lab task (lab 02, Task 4). Linear probing is the same technique used in interpretability research at scale (e.g., Geiger et al. 2024 on causal features).

What this file does NOT cover¶

t-SNE. Older non-linear method; broadly similar to UMAP but slower and more parameter-sensitive. We use UMAP.
Higher-dim projections (3D, 4D). A 3D plot is sometimes useful; we stick with 2D for cleanliness.
Embedding spaces of other modalities (image, audio). Same principles apply; out of scope.

Next: ../lab/00-implement-embedding-module.md