Skip to content

English · Español

04 — Norms, operator norms, conditioning

🇪🇸 Una norma asigna un "tamaño" a vectores y matrices. La elección de la norma cambia el significado de "grande" y "pequeño". ||Ax|| ≤ ||A||·||x|| es la desigualdad central; toda el análisis de estabilidad de IA se apoya en ella. Ejemplo: la norma de un vector de logits de tiempos verbales nos dice si el modelo es "confiado" o "indeciso".


Vector norms

A norm assigns a non-negative real number ||v|| to every vector v, satisfying:

  1. ||v|| ≥ 0, with equality iff v = 0.
  2. ||c v|| = |c| ||v|| for scalar c (positive homogeneity).
  3. ||u + v|| ≤ ||u|| + ||v|| (triangle inequality).

The standard family in ML is the p-norm:

\[ \|v\|_p = \left(\sum_i |v_i|^p\right)^{1/p} \]

Special cases:

Norm Formula Interpretation
||v||_1 Σ |v_i| Total absolute weight. Robust to outliers, sparsity-promoting.
||v||_2 (Euclidean) √(Σ v_i²) Standard geometric length. Used everywhere in ML.
||v||_∞ max |v_i| Largest component. Used in gradient clipping.
||v||_0 (not a true norm) count of non-zero entries Sparsity. Optimized via L1 relaxation.

Equivalences (in finite dimensions):

  • ||v||_∞ ≤ ||v||_2 ≤ ||v||_1
  • ||v||_1 ≤ √n ||v||_2
  • ||v||_2 ≤ √n ||v||_∞

These imply that "convergence in any p-norm" is equivalent to "convergence in any other p-norm" in finite dimensions — but the constants differ, which matters for numerical bounds.

§A13 examples of vector norms

Take a length-5 tense-classification logit vector x = [1.2, 4.7, 3.1, 0.5, 2.9]:

  • ||x||_1 = 1.2 + 4.7 + 3.1 + 0.5 + 2.9 = 12.4
  • ||x||_2 = √(1.44 + 22.09 + 9.61 + 0.25 + 8.41) = √41.8 ≈ 6.47
  • ||x||_∞ = 4.7 (the "present" logit; the model's most confident class)

After softmax, the probability vector p ≈ [0.018, 0.612, 0.124, 0.009, 0.237]:

  • ||p||_1 = 1.0 (always, by construction of a probability)
  • ||p||_2 ≈ 0.678 (not max; max would be uniform)
  • ||p||_∞ ≈ 0.612 (the model's predicted-class confidence)

||p||_∞ is a natural "confidence" metric for a classifier. ||p||_2 is the "concentration": peaks near 1.0 when the model is sure of one class, near 1/√5 ≈ 0.447 when uniform. Use them in the eval harness (Phase 20).

Matrix norms

Matrices need operator norms — measures of how much the matrix can stretch a vector. For each vector norm ||·||_p, there's an induced matrix norm:

\[ \|A\|_p = \max_{x \neq 0} \frac{\|Ax\|_p}{\|x\|_p} = \max_{\|x\|_p = 1} \|Ax\|_p \]

Special cases:

Norm Formula Interpretation
||A||_1 max_j Σ_i |A_{ij}| Max absolute column sum
||A||_2 (operator norm) σ_1 (largest singular value) Max stretching factor
||A||_∞ max_i Σ_j |A_{ij}| Max absolute row sum
||A||_F (Frobenius) √(Σ |A_{ij}|²) = √(Σ σ_i²) Element-wise L2; not an operator norm

The operator norm ||A||_2 = σ_1 is the most important. It's derived in theory/03-svd-and-rank.md and used everywhere stability is discussed (gradient clipping, weight-norm regularization, spectral normalization, Lipschitz bounds).

The Frobenius norm ||A||_F is easy to compute (sum of squares, square root) and is not an operator norm — instead it's the L2 norm of the matrix viewed as a flat vector. It equals √(Σ σ_i²). Sometimes a useful proxy when σ_1 is expensive to compute (but for our small matrices, SVD is cheap).

The central inequality

For any operator norm ||·||_p:

\[ \|Ax\|_p \leq \|A\|_p \|x\|_p \]

This is the submultiplicativity of the operator norm with respect to vectors. Proof:

\[ \frac{\|Ax\|_p}{\|x\|_p} \leq \max_{y \neq 0} \frac{\|Ay\|_p}{\|y\|_p} = \|A\|_p \]

(The definition of ||A||_p is the max of that ratio, so any particular x gives a ratio at most ||A||_p.)

For two matrices A, B:

\[ \|AB\|_p \leq \|A\|_p \|B\|_p \]

Proof: ||AB x|| = ||A (Bx)|| ≤ ||A|| ||Bx|| ≤ ||A|| ||B|| ||x||. Taking sup over ||x||_p = 1.

This is the submultiplicativity of operator norms. It implies that if you compose L linear maps each with ||A||_2 ≤ 1, the composition has operator norm ≤ 1. Spectral normalization uses this to bound the Lipschitz constant of a network.

Condition number

The condition number of a matrix A (with respect to the L2 norm) is:

\[ \kappa(A) = \|A\|_2 \cdot \|A^{-1}\|_2 = \frac{\sigma_1}{\sigma_n} \]

If A is square and invertible. (For non-square, κ = σ_max / σ_min where the min is over non-zero σ.)

The condition number measures how badly small perturbations in x get amplified by Ax = b. If κ(A) = 10^6, then solving Ax = b can amplify a 10^{-7} perturbation in b into a 0.1 perturbation in x. Bad news for numerical stability.

Matrices with σ_n ≈ 0 are nearly singularκ → ∞. They show up in:

  • Overparameterized linear models with collinear features.
  • Attention when many tokens have nearly identical embeddings.
  • Gradient computation when activations span many orders of magnitude.

The fix is usually one of: regularization (add λ I to A, raising σ_min), preconditioning (multiply by a well-conditioned M), or just use SVD instead of inverse (more stable).

Frobenius norm and the trace

For real matrices, ||A||_F² = trace(A^T A). Two consequences:

  1. ||A||_F² is the sum of squared singular values: Σ σ_i². (Same as the trace of Σ².)
  2. The Frobenius inner product <A, B>_F = trace(A^T B) = Σ_{ij} A_{ij} B_{ij} is the natural inner product on the space of matrices. Used in gradient computation (the Frobenius derivative of f(A) is the matrix of partial derivatives).

For the §A13 conjugation-count matrix C of shape (20, 15):

  • ||C||_F² = Σ_{ij} C_{ij}² — direct computation.
  • Equals Σ σ_i² from the SVD — gives an alternative computation that's also a sanity check.

Why norms matter in ML

Five concrete uses:

  1. Gradient clipping (Phase 18). When ||grad||_2 > τ, rescale grad ← τ · grad / ||grad||. Bounds the optimizer's step size; prevents loss explosions on bad batches.
  2. Weight regularization. L2 regularization adds λ ||W||_F² to the loss; L1 adds λ ||W||_1. Both penalize "large" weights, with different sparsity behaviors.
  3. Initialization (Phase 10). Xavier/Glorot init chooses Var(W) = 2/(fan_in + fan_out) so that ||W|| is neither too large nor too small — keeping activations bounded.
  4. LayerNorm / RMSNorm (Phase 10). Both normalize activations to have ||x||_2 = √D (or similar), removing the dependence on input scale.
  5. Spectral normalization. Divide W by σ_max(W) so ||W||_2 = 1. Bounds the layer's Lipschitz constant by 1. Used in some GANs and analysis work.

§A13 example — bounding the change in tense logits

Suppose Borja's tense-classifier weight matrix W of shape (5, D) has been measured with ||W||_2 = 3.0. The hidden state h of shape (D,) has ||h||_2 = 2.0. Then the tense logits z = W h satisfy:

\[ \|z\|_2 \leq \|W\|_2 \|h\|_2 = 3.0 \times 2.0 = 6.0 \]

So no individual logit can exceed 6.0 in absolute value. This guarantees the post-softmax probabilities won't become too peaked — even the most-extreme logit gap is 12.0 (between max and min). With exp(12) ≈ 1.6e5, softmax gives the top class probability ≤ 1.6e5 / (1.6e5 + 4) ≈ 0.99997. The bound prevents pathological overconfidence.

This kind of reasoning, multiplied across every layer of MiniGPT, is how you keep training stable. Phase 18 will exercise it.

Drill problems

Solutions in solutions/04-norms-and-conditioning-ref.md (phase-open).

  1. Compute ||v||_1, ||v||_2, ||v||_∞ for v = [3, -4, 0, 1].
  2. Prove that ||v||_∞ ≤ ||v||_2 ≤ ||v||_1 for any finite-dimensional real vector.
  3. The §A13 conjugation-count matrix C.shape = (20, 15) has singular values σ_1, ..., σ_15. Express ||C||_2, ||C||_F, κ(C) in terms of σ.
  4. Derive ||AB||_2 ≤ ||A||_2 ||B||_2 from the SVDs of A and B. (Hint: use that orthogonal matrices preserve L2 norms.)
  5. For Borja's MiniGPT, suppose each layer's weight has ||W||_2 = 1.5. The model has 4 layers, each followed by a non-linearity with Lipschitz constant 1. What is the worst-case Lipschitz bound on the output with respect to the input embedding? Why does this argument fail in practice (residual connections, layer norm)? Save the second answer for Phase 10.
  6. Show that the Frobenius norm equals √(trace(A^T A)).

One-paragraph recap

Vector norms (L1, L2, L∞) measure size; matrix norms (induced operator norms, Frobenius) measure size of matrices. The operator norm ||A||_2 = σ_1 is the most important — it gives the inequality ||Ax|| ≤ ||A|| ||x||, the foundation of every stability argument in ML (gradient clipping, regularization, spectral normalization, init scaling). The condition number κ(A) = σ_1 / σ_n measures sensitivity to perturbations. SVD (theory 03) is the universal tool for computing all of them.

What this page does NOT cover

  • Schatten p-norms (norms of singular value vectors). Out of scope.
  • Nuclear norm (sum of σ). Used in some matrix completion / low-rank recovery contexts; not in this curriculum.
  • Norms on infinite-dimensional spaces. Out of scope.
  • Norm-preserving optimization (Riemannian methods). Out of scope.

Phase 3 theory complete. Next: lab/00-shapes-by-hand.md.