Skip to content

English · Español

01 — Tensors, shapes, and einsum as a unifying grammar

🇪🇸 Un tensor es solo un array N-dimensional con shape. Aprender a leer einsum — "los índices repetidos se suman, los libres aparecen en la salida" — te da una notación que unifica dot, outer, matmul, batched matmul y contracción en una sola gramática. Esto es el alfabeto del resto del curso. Ejemplos: vectores one-hot que codifican formas verbales del §A13, matrices que actúan como tablas de búsqueda.


What "tensor" actually means here

In math, "tensor" has a precise coordinate-free meaning (multilinear map between vector spaces). In ML practice, "tensor" means an N-dimensional array of numbers, plus a shape, plus a dtype. We'll use the ML meaning throughout, but the mathematical view is worth keeping nearby — the operations we do on tensors are legitimate linear-algebra operations, expressed in a coordinate frame.

A tensor has three observable attributes:

Attribute Example
Rank (ndim) scalar = 0, vector = 1, matrix = 2, higher = 3+
Shape a tuple of dimension sizes: (3,), (4, 5), (2, 3, 4)
Dtype np.float32, np.int64, etc.

For Phase 3, dtype is fp32 throughout. Phase 2 covered the dtype zoo.

The §A13 menagerie of shapes

The shapes Borja will see in this curriculum cluster around a handful of named dimensions:

Symbol Meaning Typical value (§A13 scale)
B batch size 32
T sequence length (tokens) 16
V vocabulary size 600 (§A13)
D embedding / model dimension 64
H number of attention heads 4
D_k per-head dimension (D / H) 16
D_ff feed-forward intermediate 256
K top-k or rank small (5, 8, 32)

A few examples Borja will encounter:

  • A token batch: tokens.shape = (B, T), dtype int64, entries in [0, V).
  • The embedding matrix: E.shape = (V, D).
  • The post-embedding activations: x.shape = (B, T, D).
  • The attention scores in one head: scores.shape = (B, H, T, T).
  • The output logits over the vocabulary: logits.shape = (B, T, V).

If you make the shape annotation a habit now, every later phase will be clearer.

Shape arithmetic

Most operations have shape rules you can compute by hand:

Operation Input shapes Output shape
Elementwise (+, *, np.exp(x)) S, S S (with broadcasting; see below)
Dot product of vectors (N,), (N,) () (scalar)
Outer product (M,), (N,) (M, N)
Matmul (2D) (M, K), (K, N) (M, N)
Batched matmul (B, M, K), (B, K, N) (B, M, N)
Reduction (x.sum(axis=k)) S S with axis k removed
Transpose (D_0, D_1, ..., D_{n-1}) permutation of those dims
Reshape S, S' (same total) S'

You will compose these constantly. The skill: read code and predict the resulting shape before running.

Broadcasting — when shapes don't match

NumPy (and PyTorch) extend shape-matching with broadcasting: an axis of size 1 is virtually expanded to match the other operand's size. The rules, in order:

  1. Align the shapes by right-padding the shorter one with 1s.
  2. For each axis (rightmost first), the sizes must be equal, or one of them must be 1.
  3. The result shape is the axis-wise max.

Example. x.shape = (B, T, D) and mean.shape = (B, T, 1). Computing x - mean:

  • Aligned shapes: (B, T, D) and (B, T, 1).
  • Axis-wise: B=B, T=T, D vs 1 → broadcast → D.
  • Result: (B, T, D).

Common pitfall: x.shape = (T, D) and bias.shape = (D,). Computing x + bias:

  • Aligned shapes: (T, D) and (1, D) (right-pad).
  • Result: (T, D). Correct.

But x.shape = (T,) and bias.shape = (T,) adds elementwise — shape (T,). Not what you wanted if you meant a column vector + scalar bias. Always think about (T, 1) vs (T,). Phase 6's broadcasting lab beats this in.

Einstein summation — the unifying grammar

Einstein summation (einsum) is a notation that names every axis with a single letter, then specifies which axes are contracted (summed over) and which remain. It unifies dot, outer, matmul, batched matmul, transpose, sum, and most other linear-algebra operations under one grammar.

The two rules

  1. A repeated index is summed over. If i appears in both inputs of a binary op, the result sums over i.
  2. An index that appears only in the output is a free dimension (one axis of the output).

That's it. The whole grammar.

Six canonical patterns

Operation Einsum string What's contracted
Dot product 'i,i->' i (sum over the shared axis)
Outer product 'i,j->ij' nothing (no shared index)
Matrix-vector 'ij,j->i' j
Matrix-matrix 'ij,jk->ik' j
Element-wise vec product 'i,i->i' nothing (no repeated output-free index)
Frobenius inner product 'ij,ij->' both i and j

The mental skill: read the indices, identify which are repeated (those get summed), which are free (those become output axes), and compute the shape.

Three §A13-anchored examples

Embedding lookup as einsum. Given a one-hot vector e_v of shape (V,) and embedding matrix E of shape (V, D):

result = np.einsum('v,vd->d', e_v, E)

Read: v is repeated (in both operands), so it's summed. d is free (only in E), so it's the output axis. Shape: (D,). Value: the row of E indexed by the position of the 1 in e_v.

Batched embedding lookup. With tokens_one_hot of shape (B, T, V) and E of shape (V, D):

result = np.einsum('btv,vd->btd', tokens_one_hot, E)

v is summed; b, t, d are free. Result shape: (B, T, D). This is exactly the operation a transformer's input embedding layer performs (or rather, simulates — real code uses E[tokens] indexing).

Attention scores. With Q and K each of shape (B, H, T, D_k):

scores = np.einsum('bhqd,bhkd->bhqk', Q, K)

d (the per-head dimension) is summed; b, h, q, k are free. Result shape: (B, H, T, T). This is the raw attention score matrix for each batch and each head. Phase 15 builds on this.

Ellipsis (...)

... in an einsum string means "any number of leading axes that are passed through". Useful when the same operation must work on both batched and unbatched inputs:

np.einsum('...ij,...jk->...ik', A, B)   # matmul, batched or not

Use sparingly — explicit axes are easier to read.

Sum reduction via einsum

A sum over the last axis: np.einsum('ij->i', M). The j is omitted in the output, so it's summed. Equivalent to M.sum(axis=-1).

Diagonal extraction

np.einsum('ii->i', M) extracts the diagonal of a square matrix M. Repeated index on the same operand → diagonal.

Why einsum is the right grammar for ML

Three reasons:

  1. Self-documenting shapes. The indices in the einsum string are the shape annotations. You can't write a matmul whose shapes don't agree.
  2. One operation, all backends. np.einsum, torch.einsum, jax.numpy.einsum, and even tf.einsum accept the same strings (modulo a few extensions). Code written in einsum ports across frameworks.
  3. No mental load for broadcasting + reduction combinations. A pattern like "for every head, dot Q rows with K rows" is one einsum string; in regular NumPy it's three lines of transpose + reshape + matmul + reshape back.

You'll write maybe 50 einsum strings before Phase 30. The first 5 will feel awkward. The rest will feel native.

Performance — does einsum cost more than matmul?

For two-tensor contractions, no. NumPy's einsum typically dispatches to BLAS for standard patterns ('ij,jk->ik'np.matmul).

For three-or-more-tensor contractions (rare in ML inference, common in MERA tensor networks), naive einsum can be much slower than an optimal contraction order. np.einsum(..., optimize=True) finds the best order automatically.

For Phase 3's purposes (and most of this curriculum), einsum is exactly as fast as the underlying matmul. The mental clarity is the only thing you're trading for.

Shape comment convention

This convention is used by every later phase. Adopt it now:

# (B, T, D)            ← input
x = ...

# (V, D)               ← weight
W = ...

# (B, T, V)            ← output logits
logits = np.einsum('btd,vd->btv', x, W)

The shape comment lives on the assignment line. If the shape changes (transpose, reshape, contraction), the comment updates. CI doesn't enforce it — discipline does.

Drill problems

Solutions in solutions/01-tensors-and-shapes-ref.md (phase-open). Work these on paper.

  1. Given A.shape = (3, 4, 5), what is the shape of np.einsum('ijk->kji', A)?
  2. Given x.shape = (B, T, D) and y.shape = (B, T, D), what is the shape of np.einsum('btd,btd->bt', x, y)? What is it computing?
  3. Given e_v.shape = (V,) (one-hot, single token) and E.shape = (V, D), write the einsum that produces the embedding. Now write it for a batch of one-hot vectors tokens.shape = (B, T, V).
  4. Given scores.shape = (B, H, T, T) (attention scores) and V.shape = (B, H, T, D_k) (values), write the einsum that computes the attention output (B, H, T, D_k).
  5. What does np.einsum('ii->', M) compute on a square matrix?
  6. The MiniGPT logits at the final layer have shape (B, T, D). They get projected back to vocabulary space by multiplying with W_out of shape (V, D) — but wait, the output should be (B, T, V). Write the einsum. What's the rule about transpose conventions?

One-paragraph recap

Tensors are N-dimensional arrays with shape and dtype; shape is a type signature you should comment everywhere. NumPy's broadcasting + einsum together give a grammar that expresses dot, outer, matmul, batched matmul, contraction, transpose, and reduction in one notation. Memorize the einsum rules (repeated index is summed, free index is in the output) until you can predict shapes by reading. Every later phase relies on this skill.

What this page does NOT cover

  • The numerical accuracy of contractions (Phase 2 + 3.4 norms).
  • Custom-broadcast tricks that fight the rules (out of scope).
  • torch.einsum differences (Phase 25).

Next: theory/02-matmul-and-shapes.md.