English · Español
Lab 02 — SVD compression of the §A13 conjugation-count matrix¶
Goal: compress a verb conjugation-count matrix with rank-k SVD; see Eckart-Young at work on the actual data structure that motivates word embeddings.
Estimated time: 60–90 minutes.
Prereq:
theory/03-svd-and-rank.mdandtheory/04-norms-and-conditioning.mdread.
What you produce¶
A directory experiments/03-svd-compression/ containing:
build_matrix.py— synthesises the conjugation-count matrixCfrom §A13 verbs (Borja writes).compress.py— runs SVD and rank-k reconstructions.spectrum.png— singular values vs index, log-y.reconstructions.png— heatmaps ofC,C_1,C_3,C_5,C_10,C_15side-by-side.error_curve.png—||C - C_k||_F / ||C||_Fvsk.results.json— error, captured energy, storage measurements perk.manifest.json.README.md— interpretation, 2-3 paragraphs.
The §A13 matrix¶
Build C of shape (20, 15):
- Rows: the 20 verbs of §A13 — 12 regular (
work, play, walk, talk, listen, watch, study, finish, start, look, want, like) + 8 irregular (be, have, do, go, come, see, eat, write). - Columns: the 15 conjugation cells = 5 tenses × 3 persons. Order:
(infinitive, I) (infinitive, you) (infinitive, he) (present, I) (present, you) (present, he) (past, I) (past, you) (past, he) (pastPart, I) (pastPart, you) (pastPart, he) (future, I) (future, you) (future, he). - Entry
C[i, j]: the count of how often verbiappears in conjugationjin a synthetic corpus of 10,000 sentences.
Synthesising counts¶
You don't have a real corpus yet (that's Phase 13). For Phase 3, fake it with a structured generator so the spectrum is interesting:
- Sample a per-verb "frequency"
f_v ~ Zipf(s=1.2)over 20 verbs, normalised so the most common verb appears ≈10× more than the rarest. - Sample a per-tense "frequency"
t_τfrom a discrete distribution roughly[0.05, 0.45, 0.20, 0.05, 0.25]over[infinitive, present, past, pastPart, future]. - Sample a per-person "frequency"
p_πroughly[0.35, 0.25, 0.40]over[I, you, he/she/it]. - Per cell, the expected count is
10_000 · f_v · t_τ · p_π. Add Poisson noise. - For irregular verbs, slightly perturb the row to break the strict rank-1 outer-product structure — otherwise
Cwould have rank exactly 1 and the lab is trivial.
Document the generator in build_matrix.py and seed it (np.random.default_rng(42)).
TODOs¶
Block A — build the matrix¶
-
build_matrix.pywritesC.npy(shape(20, 15), dtype fp32) andverb_order.json+cell_order.json(label arrays). - Print
Cto stdout when run — eyeball that it looks sensible (common verbs have larger rows; future and past-participle columns are sparser). - Reproducibility: seeded RNG, single
manifest.jsonentry per run.
Block B — SVD + reconstructions¶
-
compress.pyloadsC.npy, computesU, S, Vt = np.linalg.svd(C, full_matrices=False). -
S.shape == (15,),U.shape == (20, 15),Vt.shape == (15, 15). - For each
k ∈ {1, 2, 3, 5, 8, 10, 12, 15}: - Reconstruct
C_k = U[:, :k] @ diag(S[:k]) @ Vt[:k, :](or use the broadcast formU[:, :k] * S[:k] @ Vt[:k, :]). - Measure
frobenius_error = np.linalg.norm(C - C_k, 'fro'). - Measure
relative_error = frobenius_error / np.linalg.norm(C, 'fro'). - Measure
captured_energy = sum(S[:k]**2) / sum(S**2). - Measure
op_norm_error = S[k] if k < 15 else 0.0(Eckart-Young in operator norm). - Verify
frobenius_error == sqrt(sum(S[k:]**2))to fp32 tolerance (the Eckart-Young identity).
Block C — visualise¶
-
spectrum.png: singular valuesσ_kvskon a log-y scale. Expect a sharp drop after the first 3-5 values. -
reconstructions.png: heatmap grid ofC, C_1, C_3, C_5, C_10, C_15. Use the same colour scale across panels. Label each withkand the relative error. -
error_curve.png:relative_errorvsk(log-y). Annotate the "knee" — the smallestkwhere relative error drops below 5%.
Block D — storage analysis¶
In README.md, compute:
- Original storage:
20 × 15 = 300fp32 numbers =1200bytes. - Rank-k SVD storage:
k × (20 + 1 + 15) = 36kfp32 numbers =144kbytes. - Crossover: at
k = 8, storage is288numbers — less than the original. Atk ≥ 9, the SVD form is no longer a win (for this tiny matrix). State the crossover.
For larger matrices the win is huge — preview Phase 28 LoRA: a (D, D) adapter at D = 4096, r = 8 is 260× compression. Mention this in the README.
Block E — interpretation questions¶
In README.md, answer:
- Where is the knee? What is the smallest
ksuch that||C - C_k||_F / ||C||_F < 5%? - What captures the dominant pattern? Plot
U[:, 0](the first left-singular vector, indexed by verb) andVt[0, :](the first right-singular vector, indexed by conjugation cell). What §A13 axes do they correspond to? (Hint: σ_1's outer productσ_1 · u_1 ⊗ v_1is the rank-1 best approximation; for natural language it usually captures the "overall frequency" axis — common verbs × common forms.) - What captures the secondary patterns? Repeat for
U[:, 1], Vt[1, :]andU[:, 2], Vt[2, :]. Look for: regular-vs-irregular split, person-vs-tense splits, etc. - Effective rank. With a noise threshold of
0.01 × σ_1, how many singular values are above the threshold? This is the effective rank ofC. Compare to the matrix's full rank (≤ 15). - Foreshadow Phase 13 (embeddings). If you treat the rank-
kfactorisationC ≈ (U[:, :k] · sqrt(S[:k])) · (sqrt(S[:k]) · Vt[:k, :])as a learned representation, the left factor is a verb embedding inR^kand the right factor is a conjugation embedding inR^k. Whatkwould you pick? Why does this matter when we build the actual embedding matrix in Phase 13?
Block F — manifest¶
Standard manifest. Include the matrix-generator seed + the count of (verbs, cells) in config.
Constraints¶
np.linalg.svdonly. Do not implement SVD by hand — that's a numerical-linear-algebra elective. Alwaysfull_matrices=False.- fp32 throughout. Cast
Cto fp32 before SVD. (The matrix has small integer counts; the dtype matters more for the singular-vector inner products than for the values themselves.) - No
sklearn.decomposition.TruncatedSVD. Use the dense full SVD; the matrix is tiny. - Reproducibility. Single RNG seed;
manifest.jsonrecords it.
Pitfalls¶
- Forgetting
full_matrices=False. Without it,U.shape == (20, 20)andVt.shape == (15, 15); the lab still works but wastes 5 columns ofU. - Off-by-one in Eckart-Young check.
||C - C_k||_F² = Σ_{i ≥ k} σ_i²— the sum starts at indexk(zero-indexedS[k]is the first σ not kept). - Sign ambiguity of singular vectors. When plotting
U[:, 0], the signs are arbitrary; for interpretation, flip so that the largest-absolute-value entry is positive. Document the convention. - Treating storage as just
k · (M + N)without+1forS[:k]. Small in the absolute sense; matters in the crossover calculation. - Generating
Cwith too clean a structure (e.g., pure rank-1 outer product). The lab needs some structure beyond rank-1 so you can see secondary singular values — that's what the Poisson noise + irregular-verb perturbation are for.
§A13 anchor recap¶
The conjugation-count matrix C is the smallest, cleanest example of a co-occurrence matrix. Phase 13 (word embeddings) will explain how Word2Vec / GloVe extract dense vector representations from co-occurrence statistics — what they do is approximately this lab, on a much larger vocabulary with a context-window kernel. SVD on C is the original embedding algorithm (Latent Semantic Analysis, 1990s). Phase 28's LoRA is the same idea applied to weight updates instead of co-occurrences.
Stop conditions¶
- Three PNGs committed (
spectrum.png,reconstructions.png,error_curve.png). results.jsonhas rows for at leastk ∈ {1, 2, 3, 5, 8, 10, 12, 15}.- Eckart-Young verified numerically:
||C - C_k||_F == sqrt(sum(S[k:]**2))to fp32 tolerance for at least one mid-rangek. - README answers all five interpretation questions.
- The first three singular vectors are interpreted in §A13 terms.
When to consult solutions/¶
After committing the three PNGs + interpretation. solutions/02-svd-compression-ref.md (at phase open) discusses how the §A13 spectrum compares to a real Spanish-corpus co-occurrence spectrum and previews the Phase 13 connection.
Next lab: lab/03-norms-by-experiment.md.