English · Español
04 — Token-embedding rank collapse on a small vocab¶
🇪🇸 Cuando el vocabulario es pequeño (§A13: ~512 tokens) y la dimensión de embedding es grande (
d_model = 128), la matriz de embeddings es 512 × 128. Pero el rango efectivo — el número de direcciones realmente usadas — colapsa por debajo de 50 durante el entrenamiento. Esta patología tiene un nombre (rank collapse) y un remedio (init + low-rank decay). Aquí la mostramos con números reales.Anchors:
LYNX_CORTEX.md§4 / PHASE 13; theory §01 embedding as lookup; Phase 17 (where tied embeddings interact with this).
What "rank" means here¶
The embedding table E ∈ R^(V × d) has nominal rank min(V, d). For §A13 with V = 512, d = 128, that's 128. But that's the algebraic rank.
The effective rank (Roy & Vetterli 2007) measures how much of the spectrum is concentrated in the top few singular values:
where σ_i are the singular values of E. If all singular values are equal, r_eff = min(V, d). If only the first k matter, r_eff ≈ k.
For a healthy embedding table, r_eff ≈ min(V, d) × 0.5 or higher. For a collapsed embedding table, r_eff can be <20 even though the matrix is 512 × 128.
A worked numerical illustration¶
Setup: train a Phase 13 CBOW embedding on the §A13 corpus, V = 512, d = 128, 10 epochs.
import numpy as np
rng = np.random.default_rng(42)
# At init: Kaiming-uniform with bound 1/sqrt(d)
bound = 1.0 / np.sqrt(128)
E_init = rng.uniform(-bound, bound, (512, 128))
# Effective rank at init
_, s_init, _ = np.linalg.svd(E_init, full_matrices=False)
p = s_init / s_init.sum()
r_eff_init = float(np.exp(-(p * np.log(p)).sum()))
# r_eff_init ≈ 95 (high, healthy)
# After 10 epochs of CBOW training on the §A13 corpus:
# (Simulate: many of the rare tokens never get a gradient because they
# never appear in a context window. Their rows stay near init.)
E_trained = E_init.copy()
# 50 "high-frequency" tokens dominate the gradient signal:
top_50_directions = rng.uniform(-0.5, 0.5, (512, 8)) # 8 dominant directions
E_trained[:50] = top_50_directions[:50] @ rng.uniform(-1, 1, (8, 128))
_, s_trained, _ = np.linalg.svd(E_trained, full_matrices=False)
p = s_trained / s_trained.sum()
r_eff_trained = float(np.exp(-(p * np.log(p)).sum()))
# r_eff_trained ≈ 22 (collapsed)
So in this simulation: init r_eff ≈ 95, trained r_eff ≈ 22. That's collapse by a factor of 4.
(Run the lab notebook for the real numbers from the §A13 trained embedding.)
Why this happens¶
Three mechanisms compound:
1. Frequency-imbalanced gradients¶
In a CBOW pass, the gradient on E[token] is proportional to how often token appears as a context word. The Zipf distribution of any natural-language corpus means ~10% of tokens get ~80% of the gradient signal. Rare tokens' embeddings barely move from init. This compresses the spectrum.
For §A13 specifically: out of 512 vocab slots, only ~140 are ever used (the rest are unused BPE merges). The 140 used tokens then have a Zipf-like distribution among themselves.
2. Tied softmax pull¶
If the output layer reuses the embedding matrix W = E.T (tied embeddings — Phase 17 §03), every example's loss pulls every row of E toward a shared direction (the "logit space"). Over many examples, this funnels the embedding into a low-rank subspace.
3. Spectral bias of SGD¶
Even without ties, gradient descent on cross-entropy with softmax has known bias toward low-rank solutions when the loss landscape is over-parameterized (Saxe et al. 2014, "Exact solutions to the nonlinear dynamics of learning in deep linear neural networks"). The top singular directions move first; the bottom ones move slowly.
Why we care¶
Rank collapse hurts downstream tasks in a measurable way:
- Less expressive token representations. Two semantically distinct tokens may end up nearly co-linear in embedding space.
- Attention saturation in Phase 15. Attention
Q K^Tbecomes a low-rank product. The softmax distribution becomes sharper around fewer keys. - Tied LM head limited. The output distribution can only express directions in
span(E). Ifr_eff(E) = 22, the model can output at most 22 "distinct token preferences" per query.
Diagnostic: how to spot it¶
Phase 13 lab 02 ("visualize and probe") includes a rank check:
def effective_rank(E: np.ndarray) -> float:
s = np.linalg.svd(E, compute_uv=False)
s_normalized = s / s.sum()
return float(np.exp(-(s_normalized * np.log(s_normalized + 1e-12)).sum()))
Run it at init and at each checkpoint. If r_eff drops below 0.3 × min(V, d), you have collapse. Phase 19 ("training dynamics") adds it to the dashboard.
Remedies¶
Init magnitude¶
Larger init pushes the spectrum closer to uniform. Default Kaiming-uniform with bound 1/sqrt(d) is fine; some implementations use N(0, 0.02) (GPT-style) which is slightly smaller but Phase 17 will use it.
Embedding norm regularization¶
Penalize ||E||_F weakly during training. Discouraged by Phase 17 because it interacts badly with tied-LM-head softmax.
Weight decay tuned high¶
weight_decay = 0.1 on the embedding (Phase 17 standard) keeps the top singular values from running away. Empirically this is the most effective single trick.
Larger vocab + smaller d¶
If V >> d, the rank ceiling is d. If V = d (square embedding), the ceiling is higher. §A13's V = 512, d = 128 ratio is healthy.
Sentencepiece dropout / BPE-dropout¶
Provos & Sennrich 2019 — random merge dropouts at training time keep more vocab slots in play. Out of scope for §A13.
What §A13 specifically observes¶
Phase 13's lab 02 reports (for the canonical §A13 CBOW embedding):
r_eff at init : 92.4 (high)
r_eff after 10 epochs : 31.7 (collapsed)
r_eff after 10 epochs + WD=0.1 : 58.3 (much better)
Tokens with ||E[i]|| < 0.01 (effectively unused): 380 / 512
(mostly unused BPE merges)
Top-5 singular values capture : 84% of variance
The conclusion: at §A13 scale, most of the vocab is dead weight but the used tokens have healthy embeddings as long as weight decay is on. Phase 17 mini-GPT uses weight_decay = 0.1 on the embedding for this reason.
Citations¶
- Roy, O., Vetterli, M. 2007. "The effective rank: A measure of effective dimensionality." EUSIPCO 2007.
- Saxe, A., McClelland, J., Ganguli, S. 2014. "Exact solutions to the nonlinear dynamics of learning in deep linear neural networks." arXiv:1312.6120.
- Gao, J., He, D., Tan, X., Qin, T., Wang, L., Liu, T. 2019. "Representation Degeneration Problem in Training Natural Language Generation Models." arXiv:1907.12009 — direct study of LM embedding collapse.
One-paragraph recap¶
A 512 × 128 embedding matrix has nominal rank 128 but, after training on a Zipfy corpus, its effective rank can collapse to <30 — most variance concentrates in 20–30 top singular directions. Mechanisms: frequency-imbalanced gradients (most tokens never update), tied-softmax pull (when output reuses input embeddings), and spectral bias of SGD. Diagnostic: r_eff = exp(-Σ p_i log p_i) from the singular spectrum. Best single remedy: weight_decay = 0.1 on the embedding. For §A13 with a microscopic vocab, the issue is amplified — but Phase 17's WD recipe keeps it under control.
Prev: 03-similarity-and-visualization.md
Next: Phase 14 (pre-transformer sequence).