Skip to content

English · Español

01 — Tensor as autograd node

🇪🇸 La estructura del Tensor: cinco campos, todos isomorfos a los del Value de fase 7, pero con data y grad ahora como ndarrays. Lo único realmente nuevo es la bandera requires_grad, que controla si el nodo participa o no en el grafo.


The class shape

class Tensor:
    data: np.ndarray              # the actual values, dtype float32 or float64
    grad: np.ndarray | None       # same shape as data; None until backward populates it
    _prev: tuple[Tensor, ...]     # parents
    _op: str                      # op tag for debugging / visualization
    _backward: Callable[[], None] # closure that contributes to parents' grads
    requires_grad: bool           # if False, this node is a constant (no grad tracking)

    def __init__(self, data, requires_grad=False, _prev=(), _op="") -> None: ...
    def backward(self) -> None: ...
    # ... ops as methods and via dunders ...

Identical in shape to Phase 7's Value. The differences:

  1. data: np.ndarray instead of float. Carries shape, dtype, strides — all of Phase 6's machinery applies.
  2. grad: np.ndarray | None instead of float. The None initial value lets us cheaply detect "never had a gradient yet" vs "had a gradient that was zeroed".
  3. requires_grad: bool — the new flag.

Why grad shape always equals data shape

Optimizer update is elementwise: p.data -= lr * p.grad. For this to be well-defined, p.data and p.grad must have the same shape.

If forward broadcasts a.shape = (3,) to (4, 3) and produces output c.shape = (4, 3), the upstream gradient at c has shape (4, 3) — but the contribution to a.grad must have shape (3,), matching a.data. The sum-along-broadcast-axes operation reconciles this. (Details in theory/02.)

Rule: tensor.grad.shape == tensor.data.shape. Always. Make it a unit test.

requires_grad: who owns the graph

Some tensors should not have gradients:

  • Input data. Features fed to the model. You don't update them.
  • Constants. Things like masks, normalization constants, target labels.
  • Detached tensors. A tensor you've explicitly "cut" from the graph (e.g., for evaluating the model without gradient tracking).

Some tensors must have gradients:

  • Parameters. Weights and biases. The optimizer updates them via .grad.

The requires_grad flag controls graph participation:

x = Tensor(data, requires_grad=False)  # input, won't get a grad
w = Tensor(data, requires_grad=True)   # parameter, will get a grad
y = w @ x                              # y.requires_grad = True (inherited from w)

Propagation rule

A result tensor's requires_grad is True if any of its parents has requires_grad=True. Otherwise False.

out_requires_grad = any(p.requires_grad for p in parents)

If out.requires_grad is False, we don't even create the _backward closure — there's no point. Forward goes through; the result is a plain Tensor with no graph attached. This is a small optimization in minitorch; it's the same optimization PyTorch does (torch.no_grad() context manager).

Why not just always track?

Three reasons:

  1. Memory. Tracking grads means keeping intermediates alive until backward. For an inference forward pass on a 100M-param model, this is gigabytes saved.
  2. Speed. Building closures has overhead. Skip it when unnecessary.
  3. Correctness. Tracking gradients during evaluation can mask bugs (e.g., if you accidentally backprop through your eval pipeline). Explicit requires_grad=False makes intent clear.

In Phase 18 (training loop), Borja will use this systematically: with no_grad(): context (or its minitorch equivalent) for the eval pass.

The data dtype contract

Tensor.data is a NumPy array. The dtype is fp32 by default for our purposes — that matches what real ML inference uses (FP16 is for production; we keep FP32 for clarity and dtype-uniform tests).

For testing, we use fp64. Reason: gradcheck needs FP64 precision (~16 digits) to be meaningful. At FP32, the truncation+roundoff error in finite differences is ~1e-3 and gradcheck would have to accept absurd tolerances.

Rule: - Production-shape code in tensor.py: fp32. - Tests in tests/test_tensor_autograd.py: fp64. - The Tensor constructor accepts any dtype; doesn't force.

Forward construction template

Every op follows the same shape:

def some_op(self, other) -> Tensor:
    # 1. Compute forward data.
    out_data = some_numpy_op(self.data, other.data)

    # 2. Decide if grad tracking is needed.
    out_requires_grad = self.requires_grad or other.requires_grad

    # 3. Construct output tensor.
    out = Tensor(
        out_data,
        requires_grad=out_requires_grad,
        _prev=(self, other) if out_requires_grad else (),
        _op="some_op",
    )

    # 4. If tracking, define the backward closure.
    if out_requires_grad:
        def _backward():
            if self.requires_grad:
                self_grad_contribution = ...  # uses upstream out.grad and local Jacobian
                # Sum along broadcast axes if needed:
                self_grad_contribution = unbroadcast(self_grad_contribution, self.data.shape)
                self.grad = (self.grad if self.grad is not None else 0) + self_grad_contribution
            if other.requires_grad:
                other_grad_contribution = ...
                other_grad_contribution = unbroadcast(other_grad_contribution, other.data.shape)
                other.grad = (other.grad if other.grad is not None else 0) + other_grad_contribution
        out._backward = _backward

    return out

Note three patterns:

  1. _prev is empty if not tracking. This further saves memory.
  2. Per-parent if requires_grad. Don't compute a gradient contribution for a parent that doesn't want one.
  3. unbroadcast(grad, target_shape) is a helper that sums grad along axes where it was broadcast from target_shape. We'll implement it in 02-tensor-op-derivatives.md.
  4. self.grad if self.grad is not None else 0. First contribution lazily allocates; subsequent contributions accumulate via +. The 0 is a Python int that NumPy will broadcast trivially. (Alternative: always allocate np.zeros_like(self.data) at construction. Memory cost = a duplicate of every parameter. Phase 8 picks lazy.)

The backward() method

Identical algorithm to Phase 7:

def backward(self) -> None:
    if not self.requires_grad:
        raise RuntimeError("backward called on tensor that doesn't require grad")

    # Build topological order.
    topo = []
    visited = set()
    def build(v):
        if v in visited: return
        visited.add(v)
        for p in v._prev:
            build(p)
        topo.append(v)
    build(self)

    # Seed.
    self.grad = np.ones_like(self.data)

    # Reverse walk.
    for v in reversed(topo):
        v._backward()

Differences from Phase 7:

  • Seed is np.ones_like(self.data) instead of 1.0. If self.data is a scalar tensor (loss), np.ones_like returns a 0-D array containing 1. If self.data is a tensor (e.g., calling backward on a non-scalar), np.ones_like returns an all-ones tensor — which is not what you want unless you have a specific reason. Convention: backward() is only called on scalar tensors (losses).

PyTorch enforces this: tensor.backward() requires a scalar tensor or an explicit gradient= argument. We do the same: assert self.data.shape == () (or .ndim == 0).

  • Defensive: requires_grad must be True on the call target.

Memory implications

Phase 7 had ~100-node graphs (XOR MLP). Phase 8 will have graphs with thousands of nodes (a transformer block is ~50 ops; multiply by sequence length and batch size).

Each Tensor in memory holds: the data array (large), the grad array (same size), _prev (a few pointers), _op (a small string), _backward (a closure with captured arrays).

For an MLP forward with hidden size 256 and batch size 64: - Each hidden activation: 64 × 256 × 4 bytes = 64 KiB. - 10-layer MLP: 10 activations × 64 KiB = 640 KiB. - With autograd: 640 KiB for data + 640 KiB for grad = 1.3 MiB.

Trivial. Phase 8 has no memory concerns. Phase 18 will start to.

The __hash__ / __eq__ question revisited

Phase 7's Value used default object hash (identity-based) and didn't override __eq__. Same here for Tensor. We do not want Tensor(a) == Tensor(a) to return a bool — we want it to return a Tensor of elementwise booleans (like NumPy), and have hash still work for set membership.

Convention: - __eq__ not overridden → default identity equality. - For elementwise comparison, use Tensor.equal(other) or np.array_equal(a.data, b.data).

PyTorch handles this differently — they overload == for elementwise. We don't, because it would prevent us from using Tensor in a set (the topo sort's visited would break).

Pitfalls (will bite in lab)

  1. out.grad.shape mismatches out.data.shape. Add an assertion at the top of every _backward: assert self.grad is None or self.grad.shape == self.data.shape. (Costs runtime — gate behind a debug flag if needed.)
  2. requires_grad=True not propagated through an op. Test: (Tensor(x, requires_grad=False) + Tensor(y, requires_grad=True)).requires_grad == True.
  3. backward() on a non-scalar. Should raise. Test it.
  4. backward() on a constant (requires_grad=False) tensor. Should raise. Test it.
  5. Forgetting np.ones_like in seed. Using 1.0 makes self.grad = 1.0 (a Python float), and the first _backward will fail because it expects an ndarray.

One-paragraph recap

Tensor is isomorphic to Phase 7's Value but with data: ndarray and grad: ndarray | None. The shape contract — grad.shape == data.shape — is invariant. requires_grad controls graph participation; it propagates as the any() of parents'. The forward op template has four steps (compute, decide, construct, define _backward), and the backward() method is the same topo-sort + reverse traversal as before, with np.ones_like seeding and a scalar-only assertion. Everything structural in Phase 8 is Phase 7 with shapes; everything mathematical (covered in 02-04) is genuinely new.


Next: 02-tensor-op-derivatives.md