English · Español
03 — SVD as rotate-scale-rotate¶
🇪🇸 Cualquier matriz se factoriza como
A = UΣV^T: rotar, escalar, rotar. Esa descomposición unifica PCA, LoRA, compresión, condicionamiento y normas. La intuición geométrica vale más que el algoritmo. Aplicación concreta: comprimir la matriz(20 verbos × 15 formas)de cuentas de conjugación del §A13 usando loskvalores singulares más grandes.
The theorem¶
For any real matrix A of shape (M, N), there exist:
- An
(M, M)orthogonal matrixU(columns are an orthonormal basis ofR^M). - An
(M, N)"diagonal" matrixΣ(zero off the main diagonal, with non-negative singular valuesσ_1 ≥ σ_2 ≥ ... ≥ 0on the diagonal). - An
(N, N)orthogonal matrixV(columns are an orthonormal basis ofR^N).
Such that:
This is the Singular Value Decomposition. It exists for every matrix (including non-square, including rank-deficient, including all-zeros).
The geometric picture¶
U, Σ, and V^T each have a clean geometric interpretation:
V^Trotates input space. Orthogonal matrices preserve lengths and angles; they're rigid rotations (plus reflections).V^TrotatesR^Ninto a basis whereA's action is axis-aligned.Σscales along the new axes. Each axis is stretched by a factorσ_i ≥ 0. Axes withσ_i = 0are collapsed — vectors along them are sent to zero.Urotates output space. After scaling, rotate again to land inR^M.
So A is always a rotation, then a scaling, then another rotation. Three steps. That's the whole geometric content of linear maps.
input R^N ──V^T──> scaled space ──Σ──> rotated R^M ──U──> output R^M
x V^T x Σ V^T x U Σ V^T x = A x
Singular values and the spectrum¶
The σ_i are non-negative, ordered σ_1 ≥ σ_2 ≥ ... ≥ 0. The number of non-zero singular values is the rank of A. Geometrically: rank is the dimensionality of the output subspace.
A 5×5 matrix with all 5 σ > 0 is full-rank (its image is all of R^5). A 5×5 matrix with σ_5 = 0 is rank 4 (its image is a 4-D subspace).
In practice, "rank" is a finicky integer; "effective rank" — the number of σ above some noise threshold — is the useful concept. SVD lets you compute both.
Low-rank approximation — the Eckart-Young theorem¶
Truncate the SVD at the top k singular values:
Where U_{:, :k} is the first k columns of U, Σ_{:k, :k} is the top-left k × k block, and V_{:, :k} is the first k columns of V.
Eckart-Young: A_k is the best rank-k approximation of A in both the Frobenius norm and the operator norm. No other rank-k matrix has smaller error.
The error has a closed form:
This is why SVD is the theoretical foundation of:
- PCA — keep the top-k principal components (eigenvectors of the covariance matrix; equivalently, top-k left singular vectors of the centered data matrix).
- LoRA (Phase 28) — represent a weight update
ΔWas a product of two skinny matricesB @ AwhereB.shape = (D, r),A.shape = (r, D). This is exactly the rank-rfactorization. - Image / data compression — store
U_{:k},Σ_{:k},V_{:k}instead of the full matrix; reconstruct withA_k. - Spectral analysis — large σ correspond to dominant patterns; small σ correspond to noise.
The §A13 application — compressing a conjugation-count matrix¶
Construct a (20 verbs, 15 forms) matrix C:
- Rows: the 20 verbs of §A13 (12 regular + 8 irregular).
- Columns: the 15 conjugations = 5 tenses × 3 persons.
- Entry
C[i, j]: count of how often verbiappears in formjin a (synthetic, for Phase 3) corpus of 10,000 sentences.
What does SVD of C reveal?
σ_1will be large — it captures the overall "more common verbs appear in all forms more often" axis.σ_2andσ_3will capture structural patterns: e.g., "irregular verbs cluster differently from regular verbs", or "past forms are more frequent than future".- The remaining
σ_idecay fast — most ofCis captured by ~3-5 singular triples.
This is qualitatively the same observation that motivated word embeddings (Word2Vec, GloVe): co-occurrence matrices of natural language have low effective rank, so a compact embedding captures most of the structure. Phase 13 builds on this directly.
Lab 02 makes you do this compression and visualize the singular value decay.
Compression ratio¶
Storing the full (M, N) matrix: M × N numbers. Storing the rank-k SVD: M × k + k + k × N ≈ k × (M + N) numbers.
For our (20, 15) matrix at k = 3:
- Full:
300numbers. - Rank-3 SVD:
3 × (20 + 15) = 105numbers. - Compression ratio:
~2.9×.
For larger matrices, the win is bigger. Phase 28's LoRA: a (D, D) weight matrix at r = 8, D = 64 — full is 4096, LoRA is 8 × 128 = 1024. 4× compression for the adapter. For larger D = 4096, full is 16.8M, LoRA at r = 8 is 65K — 260× compression.
SVD vs eigendecomposition¶
For symmetric matrices, SVD and eigendecomposition coincide (up to signs): A = QΛQ^T where Q is orthogonal and Λ is the diagonal of eigenvalues. The singular values are |λ_i|, sorted in decreasing order.
For non-symmetric (or non-square) matrices, eigendecomposition either doesn't exist or involves complex numbers, while SVD always exists and is real-valued. For ML, SVD is the right tool 95% of the time.
The exception: PCA via the eigendecomposition of the covariance matrix is sometimes faster than SVD of the data matrix when N >> D (many samples, few features). For everything else, SVD wins on generality.
Numerical stability¶
np.linalg.svd is numerically stable for any matrix — it doesn't form the (potentially ill-conditioned) Gram matrix A^T A. Internally, it uses Householder reflections + Givens rotations to bidiagonalize, then a divide-and-conquer or QR-based bidiagonal SVD.
For Borja's purposes:
- Always call
np.linalg.svd(A, full_matrices=False)to get the "thin" SVD:Uof shape(M, K),Sof lengthK,V^Tof shape(K, N), whereK = min(M, N). Saves memory and computation. - Singular values are returned in decreasing order, so
S[0]is the largest. Don't sort them yourself. - Sign ambiguity:
(U_i, V_i)and(-U_i, -V_i)give the sameA.np.linalg.svdmakes a canonical choice (largest absolute entry ofUis positive), but different libraries may differ. Don't compare individual singular vectors across implementations without accounting for sign.
Submultiplicativity and ||Ax|| ≤ ||A|| · ||x||¶
Using SVD, derive the operator-norm inequality:
Since U is orthogonal (preserves norms): ||U y||_2 = ||y||_2 for any y. So:
Let z = V^T x. Since V is also orthogonal, ||z||_2 = ||x||_2.
Equality is achieved when z = e_1 (i.e., x = V e_1, the first right-singular vector). So the operator norm ||A||_2 is exactly σ_1, the largest singular value.
This is the canonical "submultiplicativity" inequality. Theory 04-norms-and-conditioning.md extends to matrix-matrix products.
Drill problems¶
Solutions in solutions/03-svd-and-rank-ref.md (phase-open).
- Compute by hand the SVD of
A = [[1, 0], [0, 2]]. (Trivial — it's already diagonal.) - The §A13 conjugation-count matrix
Cis(20, 15). What is the maximum possible rank? What does it mean forCto be full rank? What would it mean forCto have rank 1? - Show that if
A = UΣV^Tis the SVD ofA, thenA^T A = V Σ^2 V^T(which is the eigendecomposition of the Gram matrix). What are the eigenvalues ofA^T A? - The Frobenius norm of
Aequals√(Σ σ_i²). Show this from the SVD. - Given a rank-3 SVD approximation
A_3to a(20, 15)matrix, what fraction of||A||_F²is captured byA_3? Express as a formula in terms of σ. - LoRA represents a weight update
ΔW.shape = (D, D)asΔW = B @ AwithB.shape = (D, r),A.shape = (r, D). What doesrcorrespond to in SVD terms? What rank canΔWever have?
One-paragraph recap¶
SVD factors any matrix A as U Σ V^T: rotate input, scale along principal axes, rotate output. Singular values are the scaling factors; their count above zero is the rank; the top-k of them give the optimal rank-k approximation (Eckart-Young). SVD unifies PCA, LoRA, low-rank compression, condition-number analysis, and the proof of submultiplicativity. For the §A13 conjugation-count matrix, the spectrum reveals that natural-language co-occurrence is low-effective-rank — the foundation of embedding learning (Phase 13).
What this page does NOT cover¶
- The algorithm to compute SVD (Householder + QR). We use
np.linalg.svd. - Truncated/randomized SVD for huge matrices. Out of scope.
- Iterative methods (Lanczos, ARPACK). Out of scope.
- Connection between SVD and Schmidt decomposition in quantum information. Out of scope.
Next: theory/04-norms-and-conditioning.md.