Skip to content

English · Español

02 — Tensor op derivatives (with broadcasting reverse)

🇪🇸 Para cada op vamos a deducir: (a) qué hace el forward, (b) qué local Jacobian aplicar a la gradiente upstream, © cómo deshacer el broadcast si lo hubo. El "deshacer el broadcast" es la idea estrella: si forward expandió (3,) a (4, 3), backward suma a lo largo del eje expandido para devolver un (3,).


The broadcasting reverse — the key idea

Broadcasting forward expands a smaller shape to a larger one by repeating along certain axes (with stride-0 tricks under the hood). The gradient of the output with respect to the smaller input is the gradient summed along those expanded axes.

Why summation: the smaller input's element contributed to multiple output elements (one per "row" of the broadcast). By the chain rule, the gradient contribution is the sum of contributions through each output element it influenced.

Concretely: c = a + b where a.shape = (3,), b.shape = (4, 3). NumPy broadcasts a to (4, 3) (replicating along axis 0). Forward:

c[i, j] = a[j] + b[i, j]      for i ∈ [0, 4), j ∈ [0, 3)

a[j] appears in c[0, j], c[1, j], c[2, j], c[3, j]. Its gradient contribution is:

∂L/∂a[j] = ∑_i (∂L/∂c[i, j]) · (∂c[i, j]/∂a[j])
         = ∑_i (∂L/∂c[i, j]) · 1
         = (∂L/∂c).sum(axis=0)[j]

That is: sum upstream along axis 0. Result has shape (3,), matches a.shape. ✓

The general algorithm: unbroadcast(grad, target_shape)

def unbroadcast(grad: np.ndarray, target_shape: tuple[int, ...]) -> np.ndarray:
    # 1. Sum away the extra leading dims.
    n_extra = grad.ndim - len(target_shape)
    for _ in range(n_extra):
        grad = grad.sum(axis=0)
    # 2. Sum away dims of size 1 in target_shape that were broadcast in.
    for i, dim in enumerate(target_shape):
        if dim == 1 and grad.shape[i] != 1:
            grad = grad.sum(axis=i, keepdims=True)
    return grad

That function lives in minitorch/tensor.py as a helper. Every elementwise op's _backward calls it before assigning to a parent's .grad.

Elementwise ops

Add: c = a + b

  • Forward: c.data = a.data + b.data (NumPy broadcasts as needed).
  • Local Jacobians: ∂c/∂a = 1, ∂c/∂b = 1.
  • Backward:
    a_contrib = unbroadcast(c.grad, a.data.shape)
    b_contrib = unbroadcast(c.grad, b.data.shape)
    

Subtract: c = a - b

  • Forward: c.data = a.data - b.data.
  • Local Jacobians: ∂c/∂a = 1, ∂c/∂b = -1.
  • Backward:
    a_contrib = unbroadcast(c.grad, a.data.shape)
    b_contrib = unbroadcast(-c.grad, b.data.shape)
    

Multiply: c = a * b

  • Forward: c.data = a.data * b.data.
  • Local Jacobians: ∂c/∂a = b, ∂c/∂b = a.
  • Backward:
    a_contrib = unbroadcast(b.data * c.grad, a.data.shape)
    b_contrib = unbroadcast(a.data * c.grad, b.data.shape)
    

Note: b.data * c.grad is a NumPy broadcast itself — if b.data and c.grad have different shapes (which can happen when broadcasting was implicit in forward), the result has the broadcast shape. unbroadcast then sums it down to a.shape. The composition is correct.

Divide: c = a / b

  • Forward: c.data = a.data / b.data.
  • Local Jacobians: ∂c/∂a = 1/b, ∂c/∂b = -a/b².
  • Backward:
    a_contrib = unbroadcast(c.grad / b.data, a.data.shape)
    b_contrib = unbroadcast(-a.data * c.grad / (b.data ** 2), b.data.shape)
    

Negate, exp, log, relu, tanh

Unary; no broadcasting between args. But still:

  • Forward: applies NumPy primitive.
  • Backward: local Jacobian × upstream. No unbroadcast needed (shapes already match).
# c = exp(a)
def _backward():
    a.grad = (a.grad or 0) + c.data * c.grad   # since exp(a) = c

# c = log(a)
def _backward():
    a.grad = (a.grad or 0) + (1 / a.data) * c.grad

# c = relu(a)
def _backward():
    mask = (a.data > 0).astype(a.data.dtype)
    a.grad = (a.grad or 0) + mask * c.grad

# c = tanh(a)
def _backward():
    a.grad = (a.grad or 0) + (1 - c.data ** 2) * c.grad

GELU: c = gelu(a)

GELU (Gaussian Error Linear Unit) is the activation used by GPT-2, BERT, MiniGPT. Definition:

\[ \text{gelu}(x) = x \cdot \Phi(x) \]

where Φ is the CDF of the standard normal. Two common implementations:

Exact (uses erf):

import scipy.special  # we can avoid the dep; numpy has erf via math.erf elementwise
gelu = 0.5 * x * (1 + special.erf(x / np.sqrt(2)))

Tanh approximation (faster, used in original GPT-2):

gelu_approx = 0.5 * x * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * x ** 3)))

Derivative (exact form):

\[ \frac{d}{dx}\text{gelu}(x) = \Phi(x) + x \cdot \phi(x) = \Phi(x) + \frac{x}{\sqrt{2\pi}} e^{-x^2/2} \]

where φ is the standard normal PDF. Implementation:

# c = gelu(a)
def _backward():
    pdf = (1 / np.sqrt(2 * np.pi)) * np.exp(-a.data ** 2 / 2)
    cdf = 0.5 * (1 + special.erf(a.data / np.sqrt(2)))
    a.grad = (a.grad or 0) + (cdf + a.data * pdf) * c.grad

Tanh-approx derivative: derive separately, more tedious but no erf needed. Pick one for the BLUEPRINT.

Reduction ops

These reduce a tensor along one or more axes. Their backward must broadcast the upstream gradient back to the input shape (the inverse of summing).

Sum: c = a.sum(axis=axes, keepdims=keepdims)

  • Forward: c.data = a.data.sum(axis=axes, keepdims=keepdims).
  • Local Jacobian: each input element contributes 1 to its summed output.
  • Backward:
    if not keepdims:
        # Reshape upstream to add singleton dims along reduced axes.
        shape = list(a.data.shape)
        for ax in (axes if isinstance(axes, tuple) else (axes,)):
            shape[ax] = 1
        upstream_reshaped = c.grad.reshape(shape)
    else:
        upstream_reshaped = c.grad
    a.grad = (a.grad or 0) + np.broadcast_to(upstream_reshaped, a.data.shape)
    

Tricky part: if axes is None (sum over everything), the result is a scalar and you broadcast the scalar to the input shape. If axes is an int, normalize to a tuple. Handle keepdims=True (upstream already has singleton dims; no reshape needed).

Mean: c = a.mean(axis=axes, keepdims=keepdims)

  • Forward: c.data = a.data.mean(axis=axes, keepdims=keepdims).
  • Local Jacobian: each input element contributes 1/N to the mean, where N is the count of elements reduced.
  • Backward: same as sum, but divide by N.
N = np.prod([a.data.shape[ax] for ax in axes_normalized])
# ... compute upstream_reshaped as in sum ...
a.grad = (a.grad or 0) + np.broadcast_to(upstream_reshaped / N, a.data.shape)

Shape ops

These don't change values — they only re-view the tensor. Their backward inverts the shape operation.

Reshape: c = a.reshape(new_shape)

  • Forward: c.data = a.data.reshape(new_shape). Local Jacobian: identity (after permutation).
  • Backward: a.grad += c.grad.reshape(a.data.shape).

Transpose: c = a.transpose(axes)

  • Forward: c.data = a.data.transpose(axes).
  • Backward: transpose by the inverse axis permutation.
    inverse_axes = np.argsort(axes)
    a.grad += c.grad.transpose(inverse_axes)
    

broadcast_to: c = broadcast_to(a, target_shape)

  • Forward: NumPy creates a view with stride-0 axes.
  • Backward: unbroadcast(c.grad, a.data.shape). Same machinery as elementwise ops.

getitem: c = a[indices]

  • Forward: c.data = a.data[indices].
  • Backward: scatter c.grad back into a zero-filled tensor at the indexed positions.
    contribution = np.zeros_like(a.data)
    np.add.at(contribution, indices, c.grad)   # handles repeated indices
    a.grad += contribution
    

Use np.add.at instead of contribution[indices] = c.grad to correctly accumulate when indices repeat.

cat / stack

cat: concatenate along an existing axis. stack: stack along a new axis.

Forward: NumPy primitives np.concatenate and np.stack.

Backward: split the upstream gradient along the cat/stack axis and contribute each piece to the corresponding parent.

# cat([a, b, c], axis=0), shapes (Na, K), (Nb, K), (Nc, K) → (Na+Nb+Nc, K)
def _backward():
    splits = np.split(out.grad, indices_or_sections=[Na, Na+Nb], axis=0)
    a.grad = (a.grad or 0) + splits[0]
    b.grad = (b.grad or 0) + splits[1]
    c.grad = (c.grad or 0) + splits[2]

What's missing (deferred to 03-matmul-and-softmax-grads.md)

  • matmul — the big one. Per-element index manipulation is the right derivation but lengthy.
  • softmax — Jacobian is dense (N × N for length-N input), so we need the combined cross_entropy(logits, targets) op to avoid materializing it.
  • cross_entropy — combined with softmax for numerical stability and gradient simplicity.

Those three live in the next page.

Common pitfalls

  1. Forgetting to call unbroadcast. The shape contract is violated; the next op chokes or the optimizer step silently wrong.
  2. Using += on a.grad when it's None. Lazy allocation: a.grad = a.grad + ... if a.grad is not None else .... Or always allocate np.zeros_like at first access. Phase 8 BLUEPRINT picks lazy.
  3. In-place modifying upstream c.grad. A child's _backward might be called by multiple downstream paths in the graph (the diamond case from Phase 7). If you do c.grad *= something you've changed the value other paths see. Always read c.grad and write to parents; never write back to c.grad.
  4. reduce(keepdims=False) shape bug. Forgetting to add singleton dims back before broadcast → shape mismatch.
  5. np.add.at vs contribution[indices] = c.grad. The latter overwrites on repeated indices; the former accumulates. For getitem backward, always use np.add.at.
  6. dtype mismatch between data and grad. If data is fp32 and you compute grad from a 1 / x where x is fp64, grad ends up fp64. Tests will fail intermittently. Cast explicitly.

One-paragraph recap

Every elementwise op's backward computes a local-Jacobian-times-upstream-gradient product and then calls unbroadcast(contribution, parent.shape) to handle broadcasting reverse. Reduction ops' backward broadcasts the upstream gradient back via np.broadcast_to (after reshape to add singleton dims if keepdims=False). Shape ops' backward applies the inverse shape transformation. Index-based ops use np.add.at for correct accumulation under repeated indices. The high-stakes ops — matmul, softmax, cross_entropy — live on the next page.


Next: 03-matmul-and-softmax-grads.md