English · Español

00 — Motivation: from scalar to tensor¶

🇪🇸 La fase 7 te dejó con la mecánica del autograd. Aquí no inventamos nada nuevo — el algoritmo es idéntico — pero cambiamos el float por un ndarray de NumPy y, con eso, aparecen dos fuentes nuevas de complejidad: el broadcasting reverso (sumar gradientes a lo largo de ejes expandidos) y las derivadas tensoriales de ops como matmul y softmax. Esta fase te enseña a no temerles.

The shape of the new complexity¶

Phase 7's Value held a single Python float. Phase 8's Tensor holds a NumPy ndarray. The autograd algorithm is identical:

A Tensor is born from an op.
The op records _prev, _op, and a _backward closure.
backward() does a topo sort and reverse traversal.
_backward contributes to parents' .grad.

What's new:

Shapes. Every Tensor has a data.shape. Its .grad has the same shape as .data (so optimizer step p.data -= lr · p.grad makes elementwise sense).
Broadcasting in forward. a + b where a.shape = (3,) and b.shape = (4, 3) produces a (4, 3) output. NumPy handles forward.
Broadcasting in backward. When backward runs, a should receive a (3,) gradient and b a (4, 3) gradient. NumPy doesn't do this for us. We have to sum the upstream (4, 3) gradient along the broadcast axis to get the right shape for a. This is the new source of bugs.
Per-op derivatives that aren't trivial. Matmul, softmax, cross-entropy. These need careful derivation (theory/02 and theory/03).
Reduction ops. sum, mean, softmax reduce along axes; their backward must broadcast_to the upstream gradient back to the input shape.
requires_grad. Not all tensors need gradients — input data doesn't, weights do. A flag controls graph construction.
Testing strategy escalates. Per-op cross-check against PyTorch isn't enough. We also use gradcheck (finite-difference verification) and hypothesis property tests (random shape fuzzing).

Why "build it once at tensor grain"¶

Same argument as Phase 7's "build at scalar grain": framework complexity has three sources, and Phase 7 isolated source #1 (the algorithm). Phase 8 now adds sources #2 (NumPy mechanics, which Phase 6 prepared us for) and #3 (per-op tensor derivatives, which we'll derive carefully).

By the end of Phase 8, when Borja imports PyTorch in Phase 25, every line of PyTorch's autograd engine is something Borja has already implemented at smaller scale. The library stops being magic.

Topic anchor (§A13)¶

All worked shapes in this phase are drawn from the English verb grammar grid:

A (3, 5) tensor = (person × tense) logits for one verb. Reductions over axis=0 give per-tense scores; over axis=1 give per-person scores.
A (20, 3, 5) tensor = the full grammar grid (20 verbs × 3 persons × 5 tenses). Matmul against a (5, H) projection produces hidden representations.
A (B,) integer target tensor with values in {0..4} = the correct tense index for each example in a batch of size B.

The autograd code is grammar-agnostic — Tensor.matmul does not know what its axes mean. But every example chosen for theory and lab uses these shapes, so by the end of Phase 8 Borja's mental shape model and the §A13 corpus are the same thing.

What this phase produces¶

A src/minitorch/tensor.py of ~400 LOC by Borja, implementing:

The Tensor class with data, grad, _prev, _op, _backward, requires_grad.
20+ ops in three families:
Elementwise: add sub mul div neg exp log relu gelu tanh.
Reduction/shape: sum mean reshape transpose broadcast_to getitem cat stack.
High-stakes: matmul softmax cross_entropy.
backward() doing topo + reverse traversal (identical structure to Phase 7).
Cross-checks against PyTorch FP64.
gradcheck infrastructure (finite differences).
hypothesis-based property tests fuzzing random shape/op combinations.

And one toy ML experiment: a 2-layer tensor MLP trained on a grammar dataset — input one-hot(verb) ⊕ one-hot(person) (23-dim), output logits over the 5 tenses, integer targets, ~60 train / 30 val examples drawn from the §A13 conjugation grid. It will not be impressive — that's the point. The point is gradcheck passes for every op and the training works end-to-end with the autograd you wrote.

The two new bugs to fear¶

Phase 7 had two bugs to fear: (1) forgetting += in _backward, (2) wrong topological order. Phase 8 inherits both and adds two more:

Bug 3: forgetting to sum-along-broadcast-axes¶

If c = a + b was a broadcast (say a.shape = (3,), b.shape = (4, 3), c.shape = (4, 3)), backward gets an upstream gradient of shape (4, 3). The contribution to a.grad must be the upstream summed along axis 0:

a_grad_contribution = upstream.sum(axis=0)  # shape (3,) — matches a.shape ✓
b_grad_contribution = upstream             # shape (4, 3) — matches b.shape ✓

Forget the .sum(axis=0) and a.grad ends up shape (4, 3) instead of (3,). Next op fails with a shape mismatch. Or, worse, it silently broadcasts again.

This is the single most common bug in tensor autograd. Lab 02 builds the machinery to handle it generically.

Bug 4: wrong axis/keepdims in reduction backward¶

y = sum(x, axis=0) with x.shape = (B, N) produces y.shape = (N,). Backward gets an upstream gradient of shape (N,). The contribution to x.grad must be the upstream broadcast back to (B, N):

x_grad_contribution = np.broadcast_to(upstream, (B, N))

The shape contract: gradient of x must equal x.shape. Always. If you ever produce a gradient with a different shape, something is wrong upstream.

The new testing tier: `gradcheck`¶

Per-op tests against PyTorch are necessary but not sufficient. PyTorch could itself be wrong (it isn't, but humor the paranoia), and the cross-check only verifies on the specific shapes you wrote tests for.

gradcheck is the empirical alternative: given a function f: Tensor → Tensor and input x, compute the gradient two ways:

Autograd: y = f(x); y.sum().backward(); return x.grad.
Finite differences: for each element xᵢ, perturb to xᵢ ± ε, compute (f(xᵢ + ε) - f(xᵢ - ε)) / 2ε. Assemble into a vector. This is the numerical gradient.

The two should agree at FP64 to ~1e-5 with ε = 1e-7. Disagreement means the autograd is wrong.

Gradcheck is slow (O(n) function evaluations per element) but definitive. It catches bugs PyTorch comparison would miss — e.g., a backward that happens to be correct for the test inputs but wrong in general.

Phase 8 makes gradcheck part of the standard test toolkit and runs it on every op for at least one shape.

The other new testing tier: `hypothesis` property tests¶

hypothesis generates random inputs. We tell it: "give me random tensor shapes (rank 1–4, dims 1–8), random ops from this list, and for all such random combinations, gradcheck must pass". Hypothesis automatically searches for minimal counterexamples when something fails.

In practice, hypothesis finds shape edge cases that hand-written tests miss:

Rank-0 tensors (scalars).
Size-1 dimensions.
All-zero tensors.
Tensors with shape (1, 0, 3) (zero-size dim).

Phase 8 sets up hypothesis once; future phases reuse the same harness.

Why no in-place ops¶

PyTorch supports x.relu_() (in-place). minitorch.tensor will not.

The reason: in-place ops break the DAG. If _backward for c = relu(a) captures a.data to compute the mask, and then someone does a.data[mask] = 0 between forward and backward, the closure sees the new (zero) data and computes the wrong gradient.

PyTorch handles this with version counters and warnings. We handle it by not supporting in-place at all. Cleaner pedagogically. Slightly less memory-efficient. The trade-off is right for our scale.

(Phase 25, PyTorch internals, will explain how PyTorch makes in-place safe. For now: don't.)

What "Borja writes the body" looks like in Phase 8¶

Phase 7 was ~150 LOC. Phase 8 is ~400 LOC. The bigger size doesn't change the contract:

Claude writes BLUEPRINT, theory, lab statements, test stubs.
Borja writes tensor.py.
Solutions appear at phase open after Borja's prior decisions are visible.

Phase 8 is one of the longest in the curriculum. Plan ~25–30 study hours. Resist the urge to skip ops; the 20-op coverage is what makes the resulting library trustworthy for Phases 9–22.

One-paragraph recap¶

Tensor autograd is the same algorithm as scalar autograd, with two new sources of complexity bolted on: broadcasting (which must be reversed in backward by summing along expanded axes) and non-trivial per-op derivatives (matmul, softmax, cross-entropy). The testing tier escalates correspondingly: per-op PyTorch cross-checks, gradcheck for empirical verification, hypothesis for random-shape fuzzing. By the end of Phase 8, Borja owns ~400 LOC of Tensor code that does what PyTorch does at smaller scale, with every gradient verified two ways.

Next: 01-tensor-as-node.md