English · Español
01 — Tensor as autograd node¶
🇪🇸 La estructura del
Tensor: cinco campos, todos isomorfos a los delValuede fase 7, pero condataygradahora comondarrays. Lo único realmente nuevo es la banderarequires_grad, que controla si el nodo participa o no en el grafo.
The class shape¶
class Tensor:
data: np.ndarray # the actual values, dtype float32 or float64
grad: np.ndarray | None # same shape as data; None until backward populates it
_prev: tuple[Tensor, ...] # parents
_op: str # op tag for debugging / visualization
_backward: Callable[[], None] # closure that contributes to parents' grads
requires_grad: bool # if False, this node is a constant (no grad tracking)
def __init__(self, data, requires_grad=False, _prev=(), _op="") -> None: ...
def backward(self) -> None: ...
# ... ops as methods and via dunders ...
Identical in shape to Phase 7's Value. The differences:
data: np.ndarrayinstead offloat. Carries shape, dtype, strides — all of Phase 6's machinery applies.grad: np.ndarray | Noneinstead offloat. TheNoneinitial value lets us cheaply detect "never had a gradient yet" vs "had a gradient that was zeroed".requires_grad: bool— the new flag.
Why grad shape always equals data shape¶
Optimizer update is elementwise: p.data -= lr * p.grad. For this to be well-defined, p.data and p.grad must have the same shape.
If forward broadcasts a.shape = (3,) to (4, 3) and produces output c.shape = (4, 3), the upstream gradient at c has shape (4, 3) — but the contribution to a.grad must have shape (3,), matching a.data. The sum-along-broadcast-axes operation reconciles this. (Details in theory/02.)
Rule: tensor.grad.shape == tensor.data.shape. Always. Make it a unit test.
requires_grad: who owns the graph¶
Some tensors should not have gradients:
- Input data. Features fed to the model. You don't update them.
- Constants. Things like masks, normalization constants, target labels.
- Detached tensors. A tensor you've explicitly "cut" from the graph (e.g., for evaluating the model without gradient tracking).
Some tensors must have gradients:
- Parameters. Weights and biases. The optimizer updates them via
.grad.
The requires_grad flag controls graph participation:
x = Tensor(data, requires_grad=False) # input, won't get a grad
w = Tensor(data, requires_grad=True) # parameter, will get a grad
y = w @ x # y.requires_grad = True (inherited from w)
Propagation rule¶
A result tensor's requires_grad is True if any of its parents has requires_grad=True. Otherwise False.
If out.requires_grad is False, we don't even create the _backward closure — there's no point. Forward goes through; the result is a plain Tensor with no graph attached. This is a small optimization in minitorch; it's the same optimization PyTorch does (torch.no_grad() context manager).
Why not just always track?¶
Three reasons:
- Memory. Tracking grads means keeping intermediates alive until backward. For an inference forward pass on a 100M-param model, this is gigabytes saved.
- Speed. Building closures has overhead. Skip it when unnecessary.
- Correctness. Tracking gradients during evaluation can mask bugs (e.g., if you accidentally backprop through your eval pipeline). Explicit
requires_grad=Falsemakes intent clear.
In Phase 18 (training loop), Borja will use this systematically: with no_grad(): context (or its minitorch equivalent) for the eval pass.
The data dtype contract¶
Tensor.data is a NumPy array. The dtype is fp32 by default for our purposes — that matches what real ML inference uses (FP16 is for production; we keep FP32 for clarity and dtype-uniform tests).
For testing, we use fp64. Reason: gradcheck needs FP64 precision (~16 digits) to be meaningful. At FP32, the truncation+roundoff error in finite differences is ~1e-3 and gradcheck would have to accept absurd tolerances.
Rule:
- Production-shape code in tensor.py: fp32.
- Tests in tests/test_tensor_autograd.py: fp64.
- The Tensor constructor accepts any dtype; doesn't force.
Forward construction template¶
Every op follows the same shape:
def some_op(self, other) -> Tensor:
# 1. Compute forward data.
out_data = some_numpy_op(self.data, other.data)
# 2. Decide if grad tracking is needed.
out_requires_grad = self.requires_grad or other.requires_grad
# 3. Construct output tensor.
out = Tensor(
out_data,
requires_grad=out_requires_grad,
_prev=(self, other) if out_requires_grad else (),
_op="some_op",
)
# 4. If tracking, define the backward closure.
if out_requires_grad:
def _backward():
if self.requires_grad:
self_grad_contribution = ... # uses upstream out.grad and local Jacobian
# Sum along broadcast axes if needed:
self_grad_contribution = unbroadcast(self_grad_contribution, self.data.shape)
self.grad = (self.grad if self.grad is not None else 0) + self_grad_contribution
if other.requires_grad:
other_grad_contribution = ...
other_grad_contribution = unbroadcast(other_grad_contribution, other.data.shape)
other.grad = (other.grad if other.grad is not None else 0) + other_grad_contribution
out._backward = _backward
return out
Note three patterns:
_previs empty if not tracking. This further saves memory.- Per-parent
if requires_grad. Don't compute a gradient contribution for a parent that doesn't want one. unbroadcast(grad, target_shape)is a helper that sumsgradalong axes where it was broadcast fromtarget_shape. We'll implement it in02-tensor-op-derivatives.md.self.grad if self.grad is not None else 0. First contribution lazily allocates; subsequent contributions accumulate via+. The0is a Python int that NumPy will broadcast trivially. (Alternative: always allocatenp.zeros_like(self.data)at construction. Memory cost = a duplicate of every parameter. Phase 8 picks lazy.)
The backward() method¶
Identical algorithm to Phase 7:
def backward(self) -> None:
if not self.requires_grad:
raise RuntimeError("backward called on tensor that doesn't require grad")
# Build topological order.
topo = []
visited = set()
def build(v):
if v in visited: return
visited.add(v)
for p in v._prev:
build(p)
topo.append(v)
build(self)
# Seed.
self.grad = np.ones_like(self.data)
# Reverse walk.
for v in reversed(topo):
v._backward()
Differences from Phase 7:
- Seed is
np.ones_like(self.data)instead of1.0. Ifself.datais a scalar tensor (loss),np.ones_likereturns a 0-D array containing 1. Ifself.datais a tensor (e.g., calling backward on a non-scalar),np.ones_likereturns an all-ones tensor — which is not what you want unless you have a specific reason. Convention:backward()is only called on scalar tensors (losses).
PyTorch enforces this: tensor.backward() requires a scalar tensor or an explicit gradient= argument. We do the same: assert self.data.shape == () (or .ndim == 0).
- Defensive:
requires_gradmust be True on the call target.
Memory implications¶
Phase 7 had ~100-node graphs (XOR MLP). Phase 8 will have graphs with thousands of nodes (a transformer block is ~50 ops; multiply by sequence length and batch size).
Each Tensor in memory holds: the data array (large), the grad array (same size), _prev (a few pointers), _op (a small string), _backward (a closure with captured arrays).
For an MLP forward with hidden size 256 and batch size 64:
- Each hidden activation: 64 × 256 × 4 bytes = 64 KiB.
- 10-layer MLP: 10 activations × 64 KiB = 640 KiB.
- With autograd: 640 KiB for data + 640 KiB for grad = 1.3 MiB.
Trivial. Phase 8 has no memory concerns. Phase 18 will start to.
The __hash__ / __eq__ question revisited¶
Phase 7's Value used default object hash (identity-based) and didn't override __eq__. Same here for Tensor. We do not want Tensor(a) == Tensor(a) to return a bool — we want it to return a Tensor of elementwise booleans (like NumPy), and have hash still work for set membership.
Convention:
- __eq__ not overridden → default identity equality.
- For elementwise comparison, use Tensor.equal(other) or np.array_equal(a.data, b.data).
PyTorch handles this differently — they overload == for elementwise. We don't, because it would prevent us from using Tensor in a set (the topo sort's visited would break).
Pitfalls (will bite in lab)¶
out.grad.shapemismatchesout.data.shape. Add an assertion at the top of every_backward:assert self.grad is None or self.grad.shape == self.data.shape. (Costs runtime — gate behind a debug flag if needed.)requires_grad=Truenot propagated through an op. Test:(Tensor(x, requires_grad=False) + Tensor(y, requires_grad=True)).requires_grad == True.backward()on a non-scalar. Should raise. Test it.backward()on a constant (requires_grad=False) tensor. Should raise. Test it.- Forgetting
np.ones_likein seed. Using1.0makesself.grad = 1.0(a Python float), and the first_backwardwill fail because it expects an ndarray.
One-paragraph recap¶
Tensor is isomorphic to Phase 7's Value but with data: ndarray and grad: ndarray | None. The shape contract — grad.shape == data.shape — is invariant. requires_grad controls graph participation; it propagates as the any() of parents'. The forward op template has four steps (compute, decide, construct, define _backward), and the backward() method is the same topo-sort + reverse traversal as before, with np.ones_like seeding and a scalar-only assertion. Everything structural in Phase 8 is Phase 7 with shapes; everything mathematical (covered in 02-04) is genuinely new.