English · Español
04 — Gradcheck and property tests: how to know your tensor autograd is right¶
🇪🇸 Una librería de autograd que pasa los unit tests obvios y falla silenciosamente en un caso raro de broadcasting es peor que no tener autograd. Aquí montamos dos defensas: (1)
gradcheck— comparar el gradiente analítico contra una derivada numérica — y (2)hypothesis— generar formas aleatorias para fuzzar todos los caminos. Si las dos están verdes, el autograd es de fiar.
Why gradcheck¶
You wrote a _backward closure for softmax. It compiles. It runs. The output grad has the right shape. Is it correct?
Three ways to be wrong without noticing: 1. Sign error in one of the terms. 2. Forgot to multiply by a Jacobian entry. 3. Broadcasting reverse summed along the wrong axis.
The cheapest check that catches all three: compare to a numerical derivative.
The math: for a scalar-valued function f of a tensor x, the gradient at element x[i] is approximately
where e_i is the unit perturbation at index i. This is the central finite difference. It is accurate to O(ε²) for smooth f, which is why we use it instead of the forward-difference (f(x+ε) - f(x))/ε (only O(ε)).
In code, gradcheck is:
def gradcheck(f, x, eps=1e-6, atol=1e-4):
# f: callable Tensor -> scalar Tensor
# x: Tensor with requires_grad=True
# 1. Analytical gradient via autograd.
out = f(x)
out.backward()
grad_analytical = x.grad.copy()
# 2. Numerical gradient by perturbing each element.
grad_numerical = np.zeros_like(x.data)
for i in np.ndindex(x.data.shape):
x.data[i] += eps
f_plus = f(x).data
x.data[i] -= 2 * eps
f_minus = f(x).data
x.data[i] += eps # restore
grad_numerical[i] = (f_plus - f_minus) / (2 * eps)
return np.allclose(grad_analytical, grad_numerical, atol=atol)
That's the entire harness. The cost is 2 · N forward passes for a tensor of N elements. Acceptable for tensors up to ~100 elements (so don't gradcheck a (64, 64) matmul; use a (4, 5) example for the gradcheck and trust hypothesis for shape coverage).
Choosing eps¶
Two competing errors govern the gradcheck:
- Truncation error. The finite difference is only accurate to
O(ε²). Largerε→ larger truncation error. - Roundoff error. Computing
f(x + ε) - f(x - ε)subtracts two nearly-equal numbers. Catastrophic cancellation (Phase 2) means the relative error grows like1/ε.
Total error scales like ε² + δ/ε where δ is the machine epsilon. Minimised when ε ≈ δ^(1/3).
For FP64 (δ ≈ 1e-16), the optimal ε is ~1e-5. For FP32 (δ ≈ 1e-7), it's ~1e-3 — but at FP32 the roundoff already swamps any meaningful gradcheck. Always gradcheck at FP64. This is non-negotiable.
The plot of gradcheck error vs ε is U-shaped: high on the left (roundoff), high on the right (truncation), with a minimum near 1e-5-1e-6. Lab 02 has Borja produce this plot for one op.
🇪🇸 La regla de oro: si
gradcheckfalla coneps=1e-6y FP64, el gradiente está mal, no el chequeo. Es muy raro que la derivada numérica esté mal y la analítica bien.
Per-op gradcheck recipe¶
For each op:
1. Construct a small random input (shape ≤ (3, 4) typically).
2. Convert to FP64 (x = Tensor(np.array(..., dtype=np.float64), requires_grad=True)).
3. Define f(x) = op(x).sum() — gradcheck wants a scalar output.
4. Call gradcheck(f, x).
5. Assert the return is True at atol=1e-4.
For binary ops (add, mul, matmul), gradcheck both inputs independently: hold B fixed and check ∂/∂A, then swap.
Hypothesis: random-shape fuzzing¶
Gradcheck on hand-picked shapes catches most bugs. The bug it misses is the broadcasting bug at a shape you didn't test.
hypothesis is a Python property-testing library. You write a strategy (a random generator) for inputs and a property (assertion that should hold for all inputs). Hypothesis tries many random draws and shrinks failing examples to a minimal reproducer.
A shape strategy for our autograd:
from hypothesis import strategies as st
shape_strat = st.lists(st.integers(min_value=1, max_value=4), min_size=1, max_size=3).map(tuple)
# Generates shapes like (3,), (2, 4), (1, 3, 2), etc.
def array_strat(shape):
return st.lists(
st.floats(min_value=-2.0, max_value=2.0, allow_nan=False, allow_infinity=False),
min_size=int(np.prod(shape)), max_size=int(np.prod(shape)),
).map(lambda xs: np.array(xs, dtype=np.float64).reshape(shape))
A property test for add:
@given(shape=shape_strat)
def test_add_gradient(shape, data=data()):
a = data.draw(array_strat(shape))
b = data.draw(array_strat(shape))
A = Tensor(a, requires_grad=True)
B = Tensor(b, requires_grad=True)
f = lambda: (A + B).sum()
assert gradcheck_pair(f, A, B)
Hypothesis runs this with ~100 random (shape, a, b) draws by default. If any fails, it shrinks to the minimal failing case.
Broadcasting-specific strategies¶
The single highest-value hypothesis strategy is broadcastable shape pairs:
@st.composite
def broadcastable_pair(draw):
rank = draw(st.integers(min_value=1, max_value=3))
base = tuple(draw(st.integers(min_value=1, max_value=4)) for _ in range(rank))
# Now create a partner shape that broadcasts to `base`:
# - Each dim is either 1 or matches base.
# - Optionally drop leading dimensions.
partner_full = tuple(draw(st.sampled_from([d, 1])) for d in base)
drop = draw(st.integers(min_value=0, max_value=rank - 1))
partner = partner_full[drop:]
return base, partner
Each draw produces a (base_shape, partner_shape) where partner broadcasts to base. Apply elementwise ops to tensors of these shapes; assert gradcheck passes. If a single broadcast axis is summed wrong in backward, this catches it within ~10 draws.
What hypothesis cannot catch¶
Three classes of bug:
- Numerical instability that doesn't trip the small-shape test. E.g., softmax overflow for inputs of magnitude ~1000. Hypothesis defaults to bounded floats and won't generate extreme inputs unless you ask.
- Memory leaks / refcount issues. Property tests don't notice that you held onto an intermediate
Tensorand the graph grew unboundedly. - Concurrency bugs. Not relevant in Phase 8 (single-threaded).
For (1), add explicit examples for the stress tests: @example(x=np.array([1e3, 1e3, 1e3])). For (2), Phase 18 covers training-loop memory.
The two-oracle policy¶
Every op gets two tests:
- PyTorch oracle. Build the same graph in PyTorch at FP64, compute backward, compare gradients. Tolerance
1e-7. - Gradcheck oracle. Numerical differentiation as above. Tolerance
1e-4(the gap from PyTorch reflects FP roundoff in the central difference; not a bug).
If both pass, the op is right.
If PyTorch matches but gradcheck fails: you have a bug in the gradcheck, not the op. Re-derive your numerical formula.
If gradcheck passes but PyTorch fails: subtle. Sometimes a sign or transpose convention differs. Check op semantics against PyTorch docs.
If both fail: the op is wrong. Re-derive from theory/02 or 03.
Test budget¶
For Phase 8's 20 ops, the test count is:
- Per-op PyTorch cross-check: 20 tests, ~5 lines each. ~100 lines.
- Per-op gradcheck: 20 tests, ~10 lines each. ~200 lines.
- Broadcasting-pair hypothesis tests: 5 strategy-based tests covering elementwise ops × broadcastable shapes. ~50 lines.
- Edge cases (manual): softmax with extreme inputs, CE with one-hot targets at edge classes, reshape that changes rank. ~50 lines.
Total: ~400 LOC of tests. Roughly the same size as tensor.py itself. This is normal and correct. Phase 8's test:code ratio is ~1:1 because the cost of a silent autograd bug is enormous.
When to stop testing¶
When (a) PyTorch cross-check is green for all 20 ops, (b) gradcheck is green for all 20 ops at FP64, © the broadcasting hypothesis tests have run ≥ 100 draws each without failure, and (d) the experiments/08-tense-classifier/ MLP trains to > 90% accuracy. At that point the autograd is trustworthy enough to build Phase 9 on top of.
What this page does NOT cover¶
- PyTorch's own gradcheck internals (
torch.autograd.gradcheck). Functionally identical; PyTorch's adds a Jacobian-vector-product path for non-scalar outputs. - Forward-mode autodiff cross-check. Useful for high-dimensional outputs; not needed at our scale.
- Higher-order gradchecks.
gradgradcheck— verifying second derivatives. Out of scope (we don't implement second-order backward inminitorch).
Next: lab/00-tensor-skeleton.md.