English · Español

Phase 08 — Tensor Autograd from Scratch¶

Requires: 07 — Scalar Autograd from Scratch (minigrad) Teaches: tensor-autograd · broadcasting-backward · gradcheck · matmul-grad · softmax-grad Jump to any chapter from the phase reference index.

Chapter map¶

Pre-written per LYNX_CORTEX_ADDENDUM.md §A12. Theory and lab statements pre-written; solutions populated just-in-time at phase open.

🇪🇸 Subimos el autograd a tensores. La mecánica (DAG, backward por op, traversal inverso) es la misma de fase 7; lo nuevo y trapero es el broadcasting reverso: si el forward expandió (3,) a (4,3), el backward tiene que sumar el gradiente a lo largo de los ejes expandidos para devolver un (3,). El gradcheck numérico es lo que evita que vivas con bugs silenciosos.

Goal¶

Lift the scalar autograd of Phase 7 to NumPy tensors and implement ≥20 ops with rigorous testing. This is also the phase in which the src/minitorch/ module is born: scalar autograd stays in minigrad, tensor autograd moves to minitorch. The split mirrors torch vs torch.nn and prepares Phase 9's minimodel to import from minitorch.

The two new sources of difficulty are:

Broadcasting reverse rule — every op that broadcasts in forward must sum its gradient along the broadcast axes in backward.
Reduction ops — sum, mean, softmax, cross_entropy need explicit handling of axis and keepdims.

The pedagogical claim: most "framework bugs" people encounter are broadcasting bugs. Build the broadcasting machinery yourself and they stop being mysterious.

By phase close, Borja owns:

src/minitorch/tensor.py, ~400 LOC, his own implementation.
Per-op cross-check against PyTorch FP64 at 1e-7.
A hypothesis property-test suite that fuzzes random shape/op combinations.
A targeted broadcasting-backward test suite covering all common shape pairs.
An end-to-end MLP trained on a §A13 toy classification task — input one-hot(verb) ⊕ one-hot(person) → logits over 5 tenses — using only minitorch.tensor.

Topic anchor (§A13). Worked tensor shapes throughout theory and lab use the grammar grid: a (3, 5) tensor encodes (person × tense) for one verb; a (20, 3, 5) tensor stretches it to the whole vocabulary; CE loss is computed against integer tense indices 0..4.

Read order¶

theory/00-motivation.md — why tensor autograd is the right next step.
theory/01-tensor-as-node.md — Tensor = (data, grad, _prev, _op, _backward, requires_grad). The shape of the class.
theory/02-tensor-op-derivatives.md — backward of every op we'll implement, with broadcasting handled explicitly.
theory/03-matmul-and-softmax-grads.md — the two derivations every ML engineer must be able to reproduce: matmul backward and softmax-CE combined gradient.
theory/04-gradcheck-and-property-tests.md — finite-difference gradchecking, choosing ε, the U-shaped error curve, and hypothesis strategies for shape fuzzing.
lab/00-tensor-skeleton.md — class skeleton, no ops.
lab/01-elementwise-ops.md — add sub mul div neg exp log relu gelu tanh, with broadcasting reverse handled.
lab/02-reduction-and-shape-ops.md — sum mean reshape transpose broadcast_to getitem cat stack.
lab/03-matmul-softmax-ce.md — the three high-stakes ops: matmul (with batch dims), softmax (standalone), cross_entropy(logits, targets) (combined for stability).

solutions/ populated at phase open.

Definition of Done¶

See PHASE_08_PLAN.md §6. Briefly:

src/minitorch/tensor.py ≥ 20 ops, cross-checked vs PyTorch at 1e-7.
Hypothesis property tests green for 100 random shape/op draws each.
Broadcasting-backward suite green on all shape pairs in the catalog.
Toy MLP > 90% val accuracy.
Borja can add silu in one session: forward + backward + test + theory blurb.
/quiz 08 ≥ 70%.

What this phase intentionally does NOT cover¶

nn.Module / Parameter abstractions. Phase 9.
Optimizers as classes. Phase 9; here we still hand-roll p.data -= lr * p.grad.
GPU acceleration. Phase 23+.
Mixed precision. Phase 26.
In-place ops. Functional-only. Documented in BLUEPRINT.
einsum. Tempting but the backward is hairy; deferred to Phase 17 if needed.
Sparse tensors. Not in scope.
torch.compile-style graph capture / fusion. Not in scope (Phase 25 covers PyTorch's version).

Phase 8's scope: tensor-grain DAG + broadcasting-correct backwards + ≥20 ops + property tests.