English · Español

Lab 02 — SVD compression of the §A13 conjugation-count matrix¶

Goal: compress a verb conjugation-count matrix with rank-k SVD; see Eckart-Young at work on the actual data structure that motivates word embeddings.

Estimated time: 60–90 minutes.

Prereq: theory/03-svd-and-rank.md and theory/04-norms-and-conditioning.md read.

What you produce¶

A directory experiments/03-svd-compression/ containing:

build_matrix.py — synthesises the conjugation-count matrix C from §A13 verbs (Borja writes).
compress.py — runs SVD and rank-k reconstructions.
spectrum.png — singular values vs index, log-y.
reconstructions.png — heatmaps of C, C_1, C_3, C_5, C_10, C_15 side-by-side.
error_curve.png — ||C - C_k||_F / ||C||_F vs k.
results.json — error, captured energy, storage measurements per k.
manifest.json.
README.md — interpretation, 2-3 paragraphs.

The §A13 matrix¶

Build C of shape (20, 15):

Rows: the 20 verbs of §A13 — 12 regular (work, play, walk, talk, listen, watch, study, finish, start, look, want, like) + 8 irregular (be, have, do, go, come, see, eat, write).
Columns: the 15 conjugation cells = 5 tenses × 3 persons. Order: (infinitive, I) (infinitive, you) (infinitive, he) (present, I) (present, you) (present, he) (past, I) (past, you) (past, he) (pastPart, I) (pastPart, you) (pastPart, he) (future, I) (future, you) (future, he).
Entry C[i, j]: the count of how often verb i appears in conjugation j in a synthetic corpus of 10,000 sentences.

Synthesising counts¶

You don't have a real corpus yet (that's Phase 13). For Phase 3, fake it with a structured generator so the spectrum is interesting:

Sample a per-verb "frequency" f_v ~ Zipf(s=1.2) over 20 verbs, normalised so the most common verb appears ≈10× more than the rarest.
Sample a per-tense "frequency" t_τ from a discrete distribution roughly [0.05, 0.45, 0.20, 0.05, 0.25] over [infinitive, present, past, pastPart, future].
Sample a per-person "frequency" p_π roughly [0.35, 0.25, 0.40] over [I, you, he/she/it].
Per cell, the expected count is 10_000 · f_v · t_τ · p_π. Add Poisson noise.
For irregular verbs, slightly perturb the row to break the strict rank-1 outer-product structure — otherwise C would have rank exactly 1 and the lab is trivial.

Document the generator in build_matrix.py and seed it (np.random.default_rng(42)).

TODOs¶

Block A — build the matrix¶

build_matrix.py writes C.npy (shape (20, 15), dtype fp32) and verb_order.json + cell_order.json (label arrays).
Print C to stdout when run — eyeball that it looks sensible (common verbs have larger rows; future and past-participle columns are sparser).
Reproducibility: seeded RNG, single manifest.json entry per run.

Block B — SVD + reconstructions¶

compress.py loads C.npy, computes U, S, Vt = np.linalg.svd(C, full_matrices=False).
S.shape == (15,), U.shape == (20, 15), Vt.shape == (15, 15).
For each k ∈ {1, 2, 3, 5, 8, 10, 12, 15}:
Reconstruct C_k = U[:, :k] @ diag(S[:k]) @ Vt[:k, :] (or use the broadcast form U[:, :k] * S[:k] @ Vt[:k, :]).
Measure frobenius_error = np.linalg.norm(C - C_k, 'fro').
Measure relative_error = frobenius_error / np.linalg.norm(C, 'fro').
Measure captured_energy = sum(S[:k]**2) / sum(S**2).
Measure op_norm_error = S[k] if k < 15 else 0.0 (Eckart-Young in operator norm).
Verify frobenius_error == sqrt(sum(S[k:]**2)) to fp32 tolerance (the Eckart-Young identity).

Block C — visualise¶

spectrum.png: singular values σ_k vs k on a log-y scale. Expect a sharp drop after the first 3-5 values.
reconstructions.png: heatmap grid of C, C_1, C_3, C_5, C_10, C_15. Use the same colour scale across panels. Label each with k and the relative error.
error_curve.png: relative_error vs k (log-y). Annotate the "knee" — the smallest k where relative error drops below 5%.

Block D — storage analysis¶

In README.md, compute:

Original storage: 20 × 15 = 300 fp32 numbers = 1200 bytes.
Rank-k SVD storage: k × (20 + 1 + 15) = 36k fp32 numbers = 144k bytes.
Crossover: at k = 8, storage is 288 numbers — less than the original. At k ≥ 9, the SVD form is no longer a win (for this tiny matrix). State the crossover.

For larger matrices the win is huge — preview Phase 28 LoRA: a (D, D) adapter at D = 4096, r = 8 is 260× compression. Mention this in the README.

Block E — interpretation questions¶

In README.md, answer:

Where is the knee? What is the smallest k such that ||C - C_k||_F / ||C||_F < 5%?
What captures the dominant pattern? Plot U[:, 0] (the first left-singular vector, indexed by verb) and Vt[0, :] (the first right-singular vector, indexed by conjugation cell). What §A13 axes do they correspond to? (Hint: σ_1's outer product σ_1 · u_1 ⊗ v_1 is the rank-1 best approximation; for natural language it usually captures the "overall frequency" axis — common verbs × common forms.)
What captures the secondary patterns? Repeat for U[:, 1], Vt[1, :] and U[:, 2], Vt[2, :]. Look for: regular-vs-irregular split, person-vs-tense splits, etc.
Effective rank. With a noise threshold of 0.01 × σ_1, how many singular values are above the threshold? This is the effective rank of C. Compare to the matrix's full rank (≤ 15).
Foreshadow Phase 13 (embeddings). If you treat the rank-k factorisation C ≈ (U[:, :k] · sqrt(S[:k])) · (sqrt(S[:k]) · Vt[:k, :]) as a learned representation, the left factor is a verb embedding in R^k and the right factor is a conjugation embedding in R^k. What k would you pick? Why does this matter when we build the actual embedding matrix in Phase 13?

Block F — manifest¶

Standard manifest. Include the matrix-generator seed + the count of (verbs, cells) in config.

Constraints¶

np.linalg.svd only. Do not implement SVD by hand — that's a numerical-linear-algebra elective. Always full_matrices=False.
fp32 throughout. Cast C to fp32 before SVD. (The matrix has small integer counts; the dtype matters more for the singular-vector inner products than for the values themselves.)
No sklearn.decomposition.TruncatedSVD. Use the dense full SVD; the matrix is tiny.
Reproducibility. Single RNG seed; manifest.json records it.

Pitfalls¶

Forgetting full_matrices=False. Without it, U.shape == (20, 20) and Vt.shape == (15, 15); the lab still works but wastes 5 columns of U.
Off-by-one in Eckart-Young check. ||C - C_k||_F² = Σ_{i ≥ k} σ_i² — the sum starts at index k (zero-indexed S[k] is the first σ not kept).
Sign ambiguity of singular vectors. When plotting U[:, 0], the signs are arbitrary; for interpretation, flip so that the largest-absolute-value entry is positive. Document the convention.
Treating storage as just k · (M + N) without +1 for S[:k]. Small in the absolute sense; matters in the crossover calculation.
Generating C with too clean a structure (e.g., pure rank-1 outer product). The lab needs some structure beyond rank-1 so you can see secondary singular values — that's what the Poisson noise + irregular-verb perturbation are for.

§A13 anchor recap¶

The conjugation-count matrix C is the smallest, cleanest example of a co-occurrence matrix. Phase 13 (word embeddings) will explain how Word2Vec / GloVe extract dense vector representations from co-occurrence statistics — what they do is approximately this lab, on a much larger vocabulary with a context-window kernel. SVD on C is the original embedding algorithm (Latent Semantic Analysis, 1990s). Phase 28's LoRA is the same idea applied to weight updates instead of co-occurrences.

Stop conditions¶

Three PNGs committed (spectrum.png, reconstructions.png, error_curve.png).
results.json has rows for at least k ∈ {1, 2, 3, 5, 8, 10, 12, 15}.
Eckart-Young verified numerically: ||C - C_k||_F == sqrt(sum(S[k:]**2)) to fp32 tolerance for at least one mid-range k.
README answers all five interpretation questions.
The first three singular vectors are interpreted in §A13 terms.

When to consult `solutions/`¶

After committing the three PNGs + interpretation. solutions/02-svd-compression-ref.md (at phase open) discusses how the §A13 spectrum compares to a real Spanish-corpus co-occurrence spectrum and previews the Phase 13 connection.

Next lab: lab/03-norms-by-experiment.md.