Skip to content

English · Español

02 — Op derivatives

🇪🇸 La derivada local de cada operación que Value soporta — sumar, multiplicar, dividir, potenciar, exp, log, ReLU, tanh. Estas son las únicas matemáticas nuevas de la fase. Cada derivada cabe en una línea; juntas son la materia prima de todo el backprop.


What you'll derive here

For each op c = f(a, b) (or c = f(a) for unary), the local derivative is ∂c/∂a (and ∂c/∂b for binary). The _backward closure for c will use these to contribute to parents:

a.grad += (∂c/∂a) * c.grad
b.grad += (∂c/∂b) * c.grad

This is the chain rule, one node at a time.

Addition: c = a + b

  • ∂c/∂a = 1
  • ∂c/∂b = 1

_backward:

a.grad += c.grad
b.grad += c.grad

Sanity check: c = a + b = 2 + 3 = 5. If we increase a by ε, c increases by ε (so ∂c/∂a = 1). If we increase b by ε, c increases by ε (so ∂c/∂b = 1). ✓

Subtraction: c = a - b

Reuse: c = a + (-b). So:

  • ∂c/∂a = 1
  • ∂c/∂b = -1

_backward:

a.grad += c.grad
b.grad += -c.grad

Equivalent implementation: define __neg__ (which sets _backward for negation: ∂(-a)/∂a = -1), then implement __sub__ as __add__(a, -b). Either works; the latter is cleaner and reuses code.

Multiplication: c = a * b

  • ∂c/∂a = b
  • ∂c/∂b = a

_backward:

a.grad += b.data * c.grad
b.grad += a.data * c.grad

Sanity check: c = 2 * 3 = 6. Increase a by ε: c → (2+ε)·3 = 6 + 3ε. So ∂c/∂a = 3 = b. ✓ Same for b.

Note we read .data from the parents, not the parents themselves. The local derivative is a number (a float), not a Value. (Phase 8 will revisit this when grads become tensors.)

Division: c = a / b

Treat as c = a · b^(-1). By product rule + chain rule:

  • ∂c/∂a = b^(-1) = 1/b
  • ∂c/∂b = a · (-1) · b^(-2) = -a / b²

_backward:

a.grad += (1 / b.data) * c.grad
b.grad += (-a.data / (b.data ** 2)) * c.grad

Sanity check: c = 6/3 = 2. Increase a by ε: c → (6+ε)/3 = 2 + ε/3. So ∂c/∂a = 1/3 = 1/b. ✓ Increase b by ε: c → 6/(3+ε) ≈ 6/3 · (1 - ε/3) = 2 - 2ε/3. So ∂c/∂b = -2/3 = -a/b². ✓

Watch out: division by zero. Forward will produce inf or raise ZeroDivisionError. Backward will propagate inf/nan. Decide: clamp, or raise. The blueprint will pick a convention.

Power: c = a ** n (with n constant)

For constant exponent n (not a Value):

  • ∂c/∂a = n · a^(n-1)

_backward:

a.grad += (n * a.data ** (n - 1)) * c.grad

Why n must be a constant (Python int or float), not a Value: if n were also differentiable, you'd need ∂(a^n)/∂n = a^n · ln(a), which requires a > 0 and is more brittle. In minigrad.scalar we only support constant exponents. Document this restriction in the BLUEPRINT. PyTorch's pow supports both forms; we don't, because the educational payoff is low and the foot-gun is real.

Sanity check: c = 2³ = 8. ∂c/∂a = 3·2² = 12. Increase a to 2.01: c ≈ 2.01³ = 8.12. Δc/Δa ≈ 12. ✓

Exponential: c = exp(a)

  • ∂c/∂a = exp(a) = c

_backward:

a.grad += c.data * c.grad

(We use c.data, not math.exp(a.data) — they're equal but c.data is already computed.)

Logarithm: c = log(a) (natural log)

  • ∂c/∂a = 1/a

_backward:

a.grad += (1 / a.data) * c.grad

Watch out: a ≤ 0 makes forward return nan or -inf. Decide: clamp a to a small ε, raise, or trust the caller. The combined cross_entropy(softmax(...)) pattern in Phase 8 avoids this entirely by never materializing log(0).

ReLU: c = max(0, a)

  • ∂c/∂a = 1 if a > 0 else 0

_backward:

a.grad += (1.0 if a.data > 0 else 0.0) * c.grad

The sub-gradient question: what is ∂c/∂a at a = 0? Mathematically, ReLU is not differentiable at 0 — left derivative is 0, right derivative is 1. The sub-gradient set is [0, 1]. Frameworks pick a convention; common choices are 0 (most), 0.5 (some), or 1 (rare).

We pick 0, matching PyTorch. Document this in the BLUEPRINT and unit-test it explicitly:

a = Value(0.0)
c = a.relu()  # c.data = 0.0
c.backward()
assert a.grad == 0.0  # not 0.5, not 1.0

The choice rarely matters in practice (floats are almost never exactly zero), but tests must pin it.

Tanh: c = tanh(a)

Recall tanh(a) = (e^a - e^(-a)) / (e^a + e^(-a)). Derive:

\[ \frac{d}{da} \tanh(a) = 1 - \tanh^2(a) = 1 - c^2 \]

_backward:

a.grad += (1 - c.data ** 2) * c.grad

(We use c.data, not a re-computation, for the same reason as exp.)

Why tanh is in the basic op set: it's the simplest "bounded smooth nonlinearity" useful for tiny networks. The XOR MLP in this phase's experiment uses tanh as its hidden activation. Sigmoid and softmax don't add expressive power for scalar autograd and would clutter the API; we'll add them in Phase 8 where they're combined with cross-entropy.

A side note: tanh via exp, or native?

Two implementation choices:

Option A (native): tanh is a primitive op with its own _backward using 1 - c². One node in the graph.

Option B (via exp): decompose tanh(a) = (exp(2a) - 1) / (exp(2a) + 1). Build up via existing ops. ~5 nodes in the graph; visualization shows the structure; tests check the same answer.

Both are correct. Option A is faster, fewer allocations. Option B is more pedagogically transparent (no special-case op, just composition).

Default for minigrad.scalar: Option A (native). Rationale: it's still pedagogically clear (one short closure) and matches PyTorch's structure. Option B is a fine exercise; do it once for understanding then go with A.

The BLUEPRINT records the choice. If Borja prefers B at phase open, BLUEPRINT changes.

Unary negation: c = -a

  • ∂c/∂a = -1

_backward:

a.grad += -1.0 * c.grad

Implement as __neg__ on Value, then reuse in __sub__.

Summary table

Op Forward Local derivative
+ c = a + b ∂c/∂a = 1, ∂c/∂b = 1
- (binary) c = a - b ∂c/∂a = 1, ∂c/∂b = -1
- (unary) c = -a ∂c/∂a = -1
* c = a · b ∂c/∂a = b, ∂c/∂b = a
/ c = a / b ∂c/∂a = 1/b, ∂c/∂b = -a/b²
** n c = aⁿ ∂c/∂a = n · aⁿ⁻¹
exp c = e^a ∂c/∂a = c
log c = ln(a) ∂c/∂a = 1/a
relu c = max(0, a) ∂c/∂a = 1 if a > 0 else 0
tanh c = tanh(a) ∂c/∂a = 1 - c²

Print this table. Tape it to the wall. By the end of Phase 7 you should not need to look at it.

Pitfalls (will bite in lab)

  1. Reading a.data after it changed. The closure captures a by reference; if you mutate a.data between forward and backward, the closure sees the new value. Don't mutate parameters in mid-forward.
  2. Forgetting c.grad in the contribution. The local derivative is multiplied by the upstream gradient. a.grad += b.data is wrong; it must be a.grad += b.data * c.grad.
  3. Using c.data when the rule needs a.data. Be careful which .data you reach for. Mul backward: a.grad += b.data * c.gradb.data, not a.data. Sloppy substitution is a common bug.
  4. ReLU sub-gradient at 0. Pick a convention, document it, test it.
  5. pow with non-constant exponent. Don't support it. Raise TypeError if n is a Value.

One-paragraph recap

The ten ops in minigrad.scalar each have a local derivative that fits on one line. Backprop uses these via the chain rule: each node contributes local_derivative · upstream_grad to each parent's gradient. The trickiest design choices are conventions (ReLU at 0 → 0; pow exponent must be constant; log(x≤0) is caller's responsibility). Memorize the table. From here, Phase 7's lab is just implementation — the math is settled.


Next: 03-worked-backprop.md