English · Español

03 — SVD as rotate-scale-rotate¶

🇪🇸 Cualquier matriz se factoriza como A = UΣV^T: rotar, escalar, rotar. Esa descomposición unifica PCA, LoRA, compresión, condicionamiento y normas. La intuición geométrica vale más que el algoritmo. Aplicación concreta: comprimir la matriz (20 verbos × 15 formas) de cuentas de conjugación del §A13 usando los k valores singulares más grandes.

The theorem¶

For any real matrix A of shape (M, N), there exist:

An (M, M) orthogonal matrix U (columns are an orthonormal basis of R^M).
An (M, N) "diagonal" matrix Σ (zero off the main diagonal, with non-negative singular values σ_1 ≥ σ_2 ≥ ... ≥ 0 on the diagonal).
An (N, N) orthogonal matrix V (columns are an orthonormal basis of R^N).

Such that:

\[ A = U \Sigma V^T \]

This is the Singular Value Decomposition. It exists for every matrix (including non-square, including rank-deficient, including all-zeros).

The geometric picture¶

U, Σ, and V^T each have a clean geometric interpretation:

V^T rotates input space. Orthogonal matrices preserve lengths and angles; they're rigid rotations (plus reflections). V^T rotates R^N into a basis where A's action is axis-aligned.
Σ scales along the new axes. Each axis is stretched by a factor σ_i ≥ 0. Axes with σ_i = 0 are collapsed — vectors along them are sent to zero.
U rotates output space. After scaling, rotate again to land in R^M.

So A is always a rotation, then a scaling, then another rotation. Three steps. That's the whole geometric content of linear maps.

input R^N    ──V^T──>    scaled space    ──Σ──>    rotated R^M    ──U──>    output R^M
   x                       V^T x                     Σ V^T x                  U Σ V^T x = A x

Singular values and the spectrum¶

The σ_i are non-negative, ordered σ_1 ≥ σ_2 ≥ ... ≥ 0. The number of non-zero singular values is the rank of A. Geometrically: rank is the dimensionality of the output subspace.

A 5×5 matrix with all 5 σ > 0 is full-rank (its image is all of R^5). A 5×5 matrix with σ_5 = 0 is rank 4 (its image is a 4-D subspace).

In practice, "rank" is a finicky integer; "effective rank" — the number of σ above some noise threshold — is the useful concept. SVD lets you compute both.

Low-rank approximation — the Eckart-Young theorem¶

Truncate the SVD at the top k singular values:

\[ A_k = U_{:, :k} \Sigma_{:k, :k} V_{:, :k}^T \]

Where U_{:, :k} is the first k columns of U, Σ_{:k, :k} is the top-left k × k block, and V_{:, :k} is the first k columns of V.

Eckart-Young: A_k is the best rank-k approximation of A in both the Frobenius norm and the operator norm. No other rank-k matrix has smaller error.

The error has a closed form:

\[ \|A - A_k\|_F = \sqrt{\sum_{i=k+1}^{\min(M, N)} \sigma_i^2} \]

\[ \|A - A_k\|_2 = \sigma_{k+1} \]

This is why SVD is the theoretical foundation of:

PCA — keep the top-k principal components (eigenvectors of the covariance matrix; equivalently, top-k left singular vectors of the centered data matrix).
LoRA (Phase 28) — represent a weight update ΔW as a product of two skinny matrices B @ A where B.shape = (D, r), A.shape = (r, D). This is exactly the rank-r factorization.
Image / data compression — store U_{:k}, Σ_{:k}, V_{:k} instead of the full matrix; reconstruct with A_k.
Spectral analysis — large σ correspond to dominant patterns; small σ correspond to noise.

The §A13 application — compressing a conjugation-count matrix¶

Construct a (20 verbs, 15 forms) matrix C:

Rows: the 20 verbs of §A13 (12 regular + 8 irregular).
Columns: the 15 conjugations = 5 tenses × 3 persons.
Entry C[i, j]: count of how often verb i appears in form j in a (synthetic, for Phase 3) corpus of 10,000 sentences.

What does SVD of C reveal?

σ_1 will be large — it captures the overall "more common verbs appear in all forms more often" axis.
σ_2 and σ_3 will capture structural patterns: e.g., "irregular verbs cluster differently from regular verbs", or "past forms are more frequent than future".
The remaining σ_i decay fast — most of C is captured by ~3-5 singular triples.

This is qualitatively the same observation that motivated word embeddings (Word2Vec, GloVe): co-occurrence matrices of natural language have low effective rank, so a compact embedding captures most of the structure. Phase 13 builds on this directly.

Lab 02 makes you do this compression and visualize the singular value decay.

Compression ratio¶

Storing the full (M, N) matrix: M × N numbers. Storing the rank-k SVD: M × k + k + k × N ≈ k × (M + N) numbers.

For our (20, 15) matrix at k = 3:

Full: 300 numbers.
Rank-3 SVD: 3 × (20 + 15) = 105 numbers.
Compression ratio: ~2.9×.

For larger matrices, the win is bigger. Phase 28's LoRA: a (D, D) weight matrix at r = 8, D = 64 — full is 4096, LoRA is 8 × 128 = 1024. 4× compression for the adapter. For larger D = 4096, full is 16.8M, LoRA at r = 8 is 65K — 260× compression.

SVD vs eigendecomposition¶

For symmetric matrices, SVD and eigendecomposition coincide (up to signs): A = QΛQ^T where Q is orthogonal and Λ is the diagonal of eigenvalues. The singular values are |λ_i|, sorted in decreasing order.

For non-symmetric (or non-square) matrices, eigendecomposition either doesn't exist or involves complex numbers, while SVD always exists and is real-valued. For ML, SVD is the right tool 95% of the time.

The exception: PCA via the eigendecomposition of the covariance matrix is sometimes faster than SVD of the data matrix when N >> D (many samples, few features). For everything else, SVD wins on generality.

Numerical stability¶

np.linalg.svd is numerically stable for any matrix — it doesn't form the (potentially ill-conditioned) Gram matrix A^T A. Internally, it uses Householder reflections + Givens rotations to bidiagonalize, then a divide-and-conquer or QR-based bidiagonal SVD.

For Borja's purposes:

Always call np.linalg.svd(A, full_matrices=False) to get the "thin" SVD: U of shape (M, K), S of length K, V^T of shape (K, N), where K = min(M, N). Saves memory and computation.
Singular values are returned in decreasing order, so S[0] is the largest. Don't sort them yourself.
Sign ambiguity: (U_i, V_i) and (-U_i, -V_i) give the same A. np.linalg.svd makes a canonical choice (largest absolute entry of U is positive), but different libraries may differ. Don't compare individual singular vectors across implementations without accounting for sign.

Submultiplicativity and `||Ax|| ≤ ||A|| · ||x||`¶

Using SVD, derive the operator-norm inequality:

\[ \|Ax\|_2 = \|U\Sigma V^T x\|_2 \]

Since U is orthogonal (preserves norms): ||U y||_2 = ||y||_2 for any y. So:

\[ \|Ax\|_2 = \|\Sigma V^T x\|_2 \]

Let z = V^T x. Since V is also orthogonal, ||z||_2 = ||x||_2.

\[ \|Ax\|_2 = \|\Sigma z\|_2 = \sqrt{\sum_i \sigma_i^2 z_i^2} \leq \sqrt{\sigma_1^2 \sum_i z_i^2} = \sigma_1 \|z\|_2 = \sigma_1 \|x\|_2 \]

Equality is achieved when z = e_1 (i.e., x = V e_1, the first right-singular vector). So the operator norm ||A||_2 is exactly σ_1, the largest singular value.

This is the canonical "submultiplicativity" inequality. Theory 04-norms-and-conditioning.md extends to matrix-matrix products.

Drill problems¶

Solutions in solutions/03-svd-and-rank-ref.md (phase-open).

Compute by hand the SVD of A = [[1, 0], [0, 2]]. (Trivial — it's already diagonal.)
The §A13 conjugation-count matrix C is (20, 15). What is the maximum possible rank? What does it mean for C to be full rank? What would it mean for C to have rank 1?
Show that if A = UΣV^T is the SVD of A, then A^T A = V Σ^2 V^T (which is the eigendecomposition of the Gram matrix). What are the eigenvalues of A^T A?
The Frobenius norm of A equals √(Σ σ_i²). Show this from the SVD.
Given a rank-3 SVD approximation A_3 to a (20, 15) matrix, what fraction of ||A||_F² is captured by A_3? Express as a formula in terms of σ.
LoRA represents a weight update ΔW.shape = (D, D) as ΔW = B @ A with B.shape = (D, r), A.shape = (r, D). What does r correspond to in SVD terms? What rank can ΔW ever have?

One-paragraph recap¶

SVD factors any matrix A as U Σ V^T: rotate input, scale along principal axes, rotate output. Singular values are the scaling factors; their count above zero is the rank; the top-k of them give the optimal rank-k approximation (Eckart-Young). SVD unifies PCA, LoRA, low-rank compression, condition-number analysis, and the proof of submultiplicativity. For the §A13 conjugation-count matrix, the spectrum reveals that natural-language co-occurrence is low-effective-rank — the foundation of embedding learning (Phase 13).

What this page does NOT cover¶

The algorithm to compute SVD (Householder + QR). We use np.linalg.svd.
Truncated/randomized SVD for huge matrices. Out of scope.
Iterative methods (Lanczos, ARPACK). Out of scope.
Connection between SVD and Schmidt decomposition in quantum information. Out of scope.

Next: theory/04-norms-and-conditioning.md.