Skip to content

English · Español

04 — Gradcheck and property tests: how to know your tensor autograd is right

🇪🇸 Una librería de autograd que pasa los unit tests obvios y falla silenciosamente en un caso raro de broadcasting es peor que no tener autograd. Aquí montamos dos defensas: (1) gradcheck — comparar el gradiente analítico contra una derivada numérica — y (2) hypothesis — generar formas aleatorias para fuzzar todos los caminos. Si las dos están verdes, el autograd es de fiar.


Why gradcheck

You wrote a _backward closure for softmax. It compiles. It runs. The output grad has the right shape. Is it correct?

Three ways to be wrong without noticing: 1. Sign error in one of the terms. 2. Forgot to multiply by a Jacobian entry. 3. Broadcasting reverse summed along the wrong axis.

The cheapest check that catches all three: compare to a numerical derivative.

The math: for a scalar-valued function f of a tensor x, the gradient at element x[i] is approximately

\[ \frac{\partial f}{\partial x_i} \approx \frac{f(x + \varepsilon e_i) - f(x - \varepsilon e_i)}{2\varepsilon} \]

where e_i is the unit perturbation at index i. This is the central finite difference. It is accurate to O(ε²) for smooth f, which is why we use it instead of the forward-difference (f(x+ε) - f(x))/ε (only O(ε)).

In code, gradcheck is:

def gradcheck(f, x, eps=1e-6, atol=1e-4):
    # f: callable Tensor -> scalar Tensor
    # x: Tensor with requires_grad=True
    # 1. Analytical gradient via autograd.
    out = f(x)
    out.backward()
    grad_analytical = x.grad.copy()

    # 2. Numerical gradient by perturbing each element.
    grad_numerical = np.zeros_like(x.data)
    for i in np.ndindex(x.data.shape):
        x.data[i] += eps
        f_plus = f(x).data
        x.data[i] -= 2 * eps
        f_minus = f(x).data
        x.data[i] += eps  # restore
        grad_numerical[i] = (f_plus - f_minus) / (2 * eps)

    return np.allclose(grad_analytical, grad_numerical, atol=atol)

That's the entire harness. The cost is 2 · N forward passes for a tensor of N elements. Acceptable for tensors up to ~100 elements (so don't gradcheck a (64, 64) matmul; use a (4, 5) example for the gradcheck and trust hypothesis for shape coverage).

Choosing eps

Two competing errors govern the gradcheck:

  1. Truncation error. The finite difference is only accurate to O(ε²). Larger ε → larger truncation error.
  2. Roundoff error. Computing f(x + ε) - f(x - ε) subtracts two nearly-equal numbers. Catastrophic cancellation (Phase 2) means the relative error grows like 1/ε.

Total error scales like ε² + δ/ε where δ is the machine epsilon. Minimised when ε ≈ δ^(1/3).

For FP64 (δ ≈ 1e-16), the optimal ε is ~1e-5. For FP32 (δ ≈ 1e-7), it's ~1e-3 — but at FP32 the roundoff already swamps any meaningful gradcheck. Always gradcheck at FP64. This is non-negotiable.

The plot of gradcheck error vs ε is U-shaped: high on the left (roundoff), high on the right (truncation), with a minimum near 1e-5-1e-6. Lab 02 has Borja produce this plot for one op.

🇪🇸 La regla de oro: si gradcheck falla con eps=1e-6 y FP64, el gradiente está mal, no el chequeo. Es muy raro que la derivada numérica esté mal y la analítica bien.

Per-op gradcheck recipe

For each op: 1. Construct a small random input (shape(3, 4) typically). 2. Convert to FP64 (x = Tensor(np.array(..., dtype=np.float64), requires_grad=True)). 3. Define f(x) = op(x).sum() — gradcheck wants a scalar output. 4. Call gradcheck(f, x). 5. Assert the return is True at atol=1e-4.

For binary ops (add, mul, matmul), gradcheck both inputs independently: hold B fixed and check ∂/∂A, then swap.

Hypothesis: random-shape fuzzing

Gradcheck on hand-picked shapes catches most bugs. The bug it misses is the broadcasting bug at a shape you didn't test.

hypothesis is a Python property-testing library. You write a strategy (a random generator) for inputs and a property (assertion that should hold for all inputs). Hypothesis tries many random draws and shrinks failing examples to a minimal reproducer.

A shape strategy for our autograd:

from hypothesis import strategies as st

shape_strat = st.lists(st.integers(min_value=1, max_value=4), min_size=1, max_size=3).map(tuple)
# Generates shapes like (3,), (2, 4), (1, 3, 2), etc.

def array_strat(shape):
    return st.lists(
        st.floats(min_value=-2.0, max_value=2.0, allow_nan=False, allow_infinity=False),
        min_size=int(np.prod(shape)), max_size=int(np.prod(shape)),
    ).map(lambda xs: np.array(xs, dtype=np.float64).reshape(shape))

A property test for add:

@given(shape=shape_strat)
def test_add_gradient(shape, data=data()):
    a = data.draw(array_strat(shape))
    b = data.draw(array_strat(shape))
    A = Tensor(a, requires_grad=True)
    B = Tensor(b, requires_grad=True)
    f = lambda: (A + B).sum()
    assert gradcheck_pair(f, A, B)

Hypothesis runs this with ~100 random (shape, a, b) draws by default. If any fails, it shrinks to the minimal failing case.

Broadcasting-specific strategies

The single highest-value hypothesis strategy is broadcastable shape pairs:

@st.composite
def broadcastable_pair(draw):
    rank = draw(st.integers(min_value=1, max_value=3))
    base = tuple(draw(st.integers(min_value=1, max_value=4)) for _ in range(rank))
    # Now create a partner shape that broadcasts to `base`:
    #   - Each dim is either 1 or matches base.
    #   - Optionally drop leading dimensions.
    partner_full = tuple(draw(st.sampled_from([d, 1])) for d in base)
    drop = draw(st.integers(min_value=0, max_value=rank - 1))
    partner = partner_full[drop:]
    return base, partner

Each draw produces a (base_shape, partner_shape) where partner broadcasts to base. Apply elementwise ops to tensors of these shapes; assert gradcheck passes. If a single broadcast axis is summed wrong in backward, this catches it within ~10 draws.

What hypothesis cannot catch

Three classes of bug:

  1. Numerical instability that doesn't trip the small-shape test. E.g., softmax overflow for inputs of magnitude ~1000. Hypothesis defaults to bounded floats and won't generate extreme inputs unless you ask.
  2. Memory leaks / refcount issues. Property tests don't notice that you held onto an intermediate Tensor and the graph grew unboundedly.
  3. Concurrency bugs. Not relevant in Phase 8 (single-threaded).

For (1), add explicit examples for the stress tests: @example(x=np.array([1e3, 1e3, 1e3])). For (2), Phase 18 covers training-loop memory.

The two-oracle policy

Every op gets two tests:

  1. PyTorch oracle. Build the same graph in PyTorch at FP64, compute backward, compare gradients. Tolerance 1e-7.
  2. Gradcheck oracle. Numerical differentiation as above. Tolerance 1e-4 (the gap from PyTorch reflects FP roundoff in the central difference; not a bug).

If both pass, the op is right.

If PyTorch matches but gradcheck fails: you have a bug in the gradcheck, not the op. Re-derive your numerical formula.

If gradcheck passes but PyTorch fails: subtle. Sometimes a sign or transpose convention differs. Check op semantics against PyTorch docs.

If both fail: the op is wrong. Re-derive from theory/02 or 03.

Test budget

For Phase 8's 20 ops, the test count is:

  • Per-op PyTorch cross-check: 20 tests, ~5 lines each. ~100 lines.
  • Per-op gradcheck: 20 tests, ~10 lines each. ~200 lines.
  • Broadcasting-pair hypothesis tests: 5 strategy-based tests covering elementwise ops × broadcastable shapes. ~50 lines.
  • Edge cases (manual): softmax with extreme inputs, CE with one-hot targets at edge classes, reshape that changes rank. ~50 lines.

Total: ~400 LOC of tests. Roughly the same size as tensor.py itself. This is normal and correct. Phase 8's test:code ratio is ~1:1 because the cost of a silent autograd bug is enormous.

When to stop testing

When (a) PyTorch cross-check is green for all 20 ops, (b) gradcheck is green for all 20 ops at FP64, © the broadcasting hypothesis tests have run ≥ 100 draws each without failure, and (d) the experiments/08-tense-classifier/ MLP trains to > 90% accuracy. At that point the autograd is trustworthy enough to build Phase 9 on top of.

What this page does NOT cover

  • PyTorch's own gradcheck internals (torch.autograd.gradcheck). Functionally identical; PyTorch's adds a Jacobian-vector-product path for non-scalar outputs.
  • Forward-mode autodiff cross-check. Useful for high-dimensional outputs; not needed at our scale.
  • Higher-order gradchecks. gradgradcheck — verifying second derivatives. Out of scope (we don't implement second-order backward in minitorch).

Next: lab/00-tensor-skeleton.md.