English · Español

00 — Why a linear-algebra phase before any autograd¶

🇪🇸 La intuición central: el 80% de los bugs en código de IA son errores de shape. Una buena fase de álgebra lineal te enseña a predecir y razonar sobre shapes igual que sobre tipos. Sin esa habilidad, cada forward pass es una lotería.

The lie textbooks tell¶

A linear algebra textbook tells you a matrix is "an array of numbers" and matmul is Σ A[i,k] B[k,j]. That's the mechanics. It is not what you need.

What you need is the perspective that turns linear algebra into a debugging tool:

A vector is a typed object whose dimensionality is part of its identity.
A matrix is a linear map from one vector space to another. Matrix-vector multiplication is the application of the map. Matrix-matrix multiplication is the composition of two maps.
A tensor is the same idea generalised to multilinear maps with multiple input and output axes.
A shape is the dimensional signature of a tensor, like a type signature on a function. (B, T, V) reads as "B batches of T-step sequences over a V-element vocabulary."

When you adopt this view, two things happen. First, every shape mismatch becomes a type error — visible by inspection, not by running. Second, every operation in a neural network (linear projection, attention, normalization, embedding lookup) becomes a small, named transformation of shapes. The code reads more like dimensional analysis than algebra.

The thesis of Phase 3¶

Phase 3 trains one habit:

Before you write any tensor expression, write the shape of every operand as a comment. Before you press run, predict the output shape on paper. If your prediction is wrong, the code is wrong — investigate the type error, don't run it.

By the end of Phase 3, you should be able to read an einsum expression like 'btv,vd->btd' and instantly know:

It contracts a (B, T, V) tensor with a (V, D) tensor along the shared v axis.
The result has shape (B, T, D).
The total FLOPs are B × T × V × D × 2.
It is the embedding lookup operation: for each of B × T tokens (represented as one-hot vectors of length V), look up the corresponding D-dimensional embedding from the (V, D) embedding matrix.

You'll write that exact einsum in Phase 13 and again in Phase 17. Phase 3 makes it boring.

Why "first principles" — the §A13 lens¶

The microscopic scope of this project (§A13) is English verb grammar: 20 verbs × 5 tenses × 3 persons + Spanish pairs ≈ 600 forms. Every linear-algebra operation Borja will write in the next 30 phases acts on one of three kinds of data:

Sparse encodings of verb forms — typically one-hot vectors of length 600. A one-hot at index 47 means "the verb form walked."
Dense embeddings — (600, D) learned matrix mapping each verb form to a D-dim vector. Embedding lookup is E[i] (Python indexing) which equals E @ one_hot(i) (matrix-vector product) which equals einsum('vd,v->d', E, one_hot(i)).
Small classification matrices — for example a (5, D) matrix that projects a hidden state to 5-class tense logits.

Phase 3's examples are all built from these three primitives. None of them require AI knowledge; all of them are exactly the operations Borja will use later.

Concrete preview — the embedding lookup chain¶

Suppose Borja's MiniGPT (Phase 17) has:

A vocabulary of V = 600 verb forms (the §A13 universe).
An embedding matrix E of shape (V, D) where D = 64.
A batch of B = 32 sequences, each T = 16 tokens long. Each token is an integer index into V.

A forward pass starts by looking up the embedding for each token:

tokens.shape       = (B, T)              # int indices
embedded.shape     = (B, T, D)           # post-lookup

The naïve way to compute embedded is:

embedded = E[tokens]   # NumPy advanced indexing — what you'll actually use

The "linear algebra" way is:

# Convert tokens to one-hot: shape (B, T, V)
one_hot = np.eye(V)[tokens]   # shape (B, T, V)
# Matrix-multiply with E: shape (B, T, V) @ (V, D) = (B, T, D)
embedded = np.einsum('btv,vd->btd', one_hot, E)

Both produce identical results. The first is fast (direct indexing); the second is the linear-algebra interpretation. Phase 3 makes you write both and prove they agree. Phase 13 uses the fast one; you'll know exactly what it's doing because of Phase 3.

Why this matters specifically for the rest of the curriculum¶

Five claims that should make sense after Phase 3, but probably look like jargon now:

Embedding lookup is matrix-vector multiplication with a one-hot vector. Phase 13 will literally implement it both ways and show the equivalence. Phase 3 derives why.
Attention is three matrix-vector products and a softmax (Phase 15). The Q, K, V projections are (D, D_k) linear maps; the attention scores are Q @ K^T; the output is attention @ V. Each is a matmul with named shapes. Phase 15 will be a Phase-3 exercise in disguise.
LayerNorm and RMSNorm are scale-invariant rescalings (Phase 10). The math is one matmul-by-a-diagonal and one rescaling. Trivial in einsum.
LoRA (Phase 28) replaces a D × D weight matrix with the product of two low-rank matrices (D × r) @ (r × D). It's a constrained matmul — exactly the rank-k truncation Borja will do in lab/02-svd-compression.md.
Quantization (Phase 26) is a per-channel scaling of a matmul. The math is (s · q) @ (s · q) where s is fp32 and q is int8. The shape arithmetic is identical to standard matmul; the precision argument is Phase 2.

Every one of those statements is a shape + matmul argument. You will not be able to read any later phase's code until Phase 3 lands.

The path through Phase 3¶

Theory 01 introduces tensors as typed objects, with shape arithmetic and the einsum grammar. The flat exposition you reference forever.
Theory 02 does matmul in three ways — as composition of linear maps, as sums-of-outer-products, as einsum. The same operation, three angles, so Borja never sees an "I don't recognize this notation" moment in a later phase.
Theory 03 does SVD: rotate-scale-rotate decomposition, rank, low-rank approximation, Eckart-Young theorem. Applied to a (20, 15) conjugation-count matrix.
Theory 04 does norms: vector norms, matrix norms, operator norm, condition number. Submultiplicativity is derived from SVD.
Labs 00–03 make you do shape prediction, matmul performance measurement on your i5-8250U, SVD compression of a verb-form matrix, and norm-identity verification.

What this phase is NOT¶

Phase 3 is not a numerical linear algebra course. We do not derive the QR algorithm. We do not implement SVD from scratch (it's beyond educational value at this scale). We do not study iterative solvers (CG, GMRES). We do not formally treat eigenvalue problems of non-symmetric matrices.

Phase 3 is a literacy and intuition phase: by the end, you can read tensor code, predict shapes, anticipate cost, and use SVD as a thinking tool.

Stop here if¶

You are tempted to skip Phase 3 because "I took linear algebra in college." Don't. The college version focused on hand-computing 2×2 inverses. The AI version focuses on (B, T, D) shape arithmetic and the 1e-7 precision floor inherited from Phase 2. They're different skills. The test is: can you read einsum('bhqk,bhkd->bhqd', attn, v) and immediately say what it does, what shape comes out, and how many FLOPs it costs? If not, Phase 3 is for you.

One-paragraph recap¶

Linear algebra is the language of AI mechanics: tensors are shape-typed objects, matrix operations are typed transformations, and bugs are mostly type (shape) errors. Phase 3 trains you to predict shapes from einsum strings, to derive matmul performance from the Phase-1 roofline, and to use SVD as a low-rank compression and analysis tool — anchored to §A13's verb-form vocabulary (one-hot encodings of 600 conjugations) and to the conjugation-count matrices Borja's later corpus will produce.

What this phase does NOT cover¶

Gradients / autograd (Phase 4, 7).
A reusable src/minigrad/linalg.py module (Phase ⅞).
Numerical algorithms for SVD itself (out of scope).
GPU matmul / cuBLAS / Triton (Phase 23+).

Next: theory/01-tensors-and-shapes.md.