Skip to content

English · Español

Lab 01 — Autograd by hand for nn.Linear(64, 600)

🇪🇸 Derivas a mano los gradientes de un Linear(64, 600) — los tres: ∂L/∂x, ∂L/∂W, ∂L/∂b. Luego ejecutas loss.backward() en PyTorch y comparas. El umbral es 1e-7 a fp32. Si no cuadra, lo arreglas. Esta es la práctica que cierra la convicción "PyTorch autograd es exactamente lo que construimos en Fase ⅞ — más grande, no diferente".

Objective

Derive by hand the backward formulas for y = linear(x, W, b) = x @ W.T + b followed by a scalar loss L = (y - target).pow(2).sum() / 2. Compute ∂L/∂x, ∂L/∂W, ∂L/∂b analytically, then verify against PyTorch's autograd at fp32 within element-wise 1e-7.

Setup

  • Phase ⅞ (scalar/tensor autograd) and Phase 04 (calculus theory).
  • The forward shapes: x ∈ R^(2 × 64), W ∈ R^(600 × 64), b ∈ R^(600), y ∈ R^(2 × 600), target ∈ R^(2 × 600), L ∈ R.

The math

Forward:

\[y = x W^T + b \qquad y_{i,j} = \sum_k x_{i,k} W_{j,k} + b_j\]

Loss:

\[L = \tfrac{1}{2}\sum_{i,j} (y_{i,j} - t_{i,j})^2 \qquad \frac{\partial L}{\partial y} = y - t\]

Chain rule:

  • \(\dfrac{\partial L}{\partial x} = \dfrac{\partial L}{\partial y} \cdot \dfrac{\partial y}{\partial x} = (y - t) W\)   (shape (2, 64))
  • \(\dfrac{\partial L}{\partial W} = (y - t)^T x\)   (shape (600, 64))
  • \(\dfrac{\partial L}{\partial b} = \sum_i (y_i - t_i)\)   (shape (600,))

These are the formulas the AddmmBackward0 node computes internally. The lab verifies that.

Tasks

Part A — Forward + analytical backward

import torch
torch.manual_seed(42)

x = torch.randn(2, 64, requires_grad=True)
W = torch.randn(600, 64, requires_grad=True)
b = torch.randn(600, requires_grad=True)
target = torch.randn(2, 600)

y = torch.nn.functional.linear(x, W, b)
loss = 0.5 * ((y - target) ** 2).sum()

# Analytical gradients (no autograd):
with torch.no_grad():
    dy = y - target                          # (2, 600)
    dx_manual = dy @ W                       # (2, 64)
    dW_manual = dy.T @ x                     # (600, 64)
    db_manual = dy.sum(dim=0)                # (600,)

Part B — PyTorch autograd

loss.backward()

print("dx max-err:", (x.grad - dx_manual).abs().max().item())
print("dW max-err:", (W.grad - dW_manual).abs().max().item())
print("db max-err:", (b.grad - db_manual).abs().max().item())

All three max-errors should be < 1e-5 at fp32 (and typically < 1e-7).

Part C — Walk the grad_fn chain

node = loss.grad_fn
while node is not None:
    print(type(node).__name__, [t for t in node.next_functions])
    nexts = [t[0] for t in node.next_functions if t[0] is not None]
    node = nexts[0] if nexts else None

Expected output (approximate; names vary by version):

DivBackward0   [(SumBackward0, 0)]            # the 0.5 *
SumBackward0   [(PowBackward0, 0)]
PowBackward0   [(SubBackward0, 0)]
SubBackward0   [(AddmmBackward0, 0), (None, 0)]
AddmmBackward0 [(AccumulateGrad, 0), (AccumulateGrad, 0), (TBackward0, 0)]

Identify: - AddmmBackward0 — the matmul-and-bias-add node. - AccumulateGrad — leaf nodes that accumulate gradients into .grad. - TBackward0 — the implicit transpose W → W^T that linear inserted.

Part D — Verify the 1e-7 claim at fp32

torch.manual_seed(123)
errors = []
for _ in range(20):
    x = torch.randn(2, 64, requires_grad=True)
    W = torch.randn(600, 64, requires_grad=True)
    b = torch.randn(600, requires_grad=True)
    target = torch.randn(2, 600)

    y = torch.nn.functional.linear(x, W, b)
    loss = 0.5 * ((y - target) ** 2).sum()
    with torch.no_grad():
        dy = y - target
        dx_m, dW_m, db_m = dy @ W, dy.T @ x, dy.sum(dim=0)
    loss.backward()

    errors.append((
        (x.grad - dx_m).abs().max().item(),
        (W.grad - dW_m).abs().max().item(),
        (b.grad - db_m).abs().max().item(),
    ))

import numpy as np
e = np.array(errors)
print("dx: max", e[:, 0].max(), " median", float(np.median(e[:, 0])))
print("dW: max", e[:, 1].max(), " median", float(np.median(e[:, 1])))
print("db: max", e[:, 2].max(), " median", float(np.median(e[:, 2])))

Expected: medians are ~ 1e-7, maxes are < 1e-5. The reason it's not 0.0 even though the formula is identical: floating-point summation order differs between PyTorch's addmm and your @. Document this.

Part E — Repeat at fp16, observe the degradation

x = torch.randn(2, 64, dtype=torch.float16, requires_grad=True)
# ... same as Part D

Expected: errors are now 1e-3 or worse. The order-of-summation problem amplifies in low precision. This is the canonical reason fp16 training needs loss-scaling and mixed precision (Phase 18 mentioned it; Phase 26 dives in).

Part F — Write the report

experiments/25-autograd-by-hand/REPORT.md:

  1. The math (LaTeX-rendered, three formulas).
  2. The error table from Part D (20 runs, median + max per gradient).
  3. The fp16 result from Part E with a 2-sentence explanation of why precision matters.
  4. The grad_fn chain printout from Part C.
  5. One paragraph: "PyTorch's autograd computed the same formulas I wrote by hand. The deviation at fp32 is < 1e-5, dominated by summation-order differences. At fp16 the deviation is 1e-3, large enough to affect convergence — this is the failure mode mixed-precision training addresses."

Deliverable

experiments/25-autograd-by-hand/: - REPORT.md — items above. - errors.csv — the 20-run × 3-gradient error table. - manifest.json.

Acceptance

  • fp32: all 20 runs × 3 gradients have max-err < 1e-5.
  • fp16: at least one gradient has max-err > 1e-3 (proves the precision sensitivity).
  • The grad_fn chain printout identifies AddmmBackward0, AccumulateGrad, and the transpose.
  • The interpretation paragraph correctly attributes the fp32 discrepancy to summation order.

Pitfalls

  • Wrong transpose direction. linear(x, W, b) = x @ W.T + b. If you write x @ W + b, shapes won't match (x: (2,64), W: (600,64), can't matmul).
  • Forgetting 0.5 * in the loss. Then ∂L/∂y = 2(y - t), not (y - t). Match the constant in the loss.
  • Recomputing loss.backward() without zeroing .grad. Gradients accumulate; the second call doubles the answer. Use x.grad.zero_() between runs, or rebuild the tensors fresh.
  • fp16 producing nan. Larger target magnitudes overflow squared. Scale target by 0.1 if you see inf.
  • Comparing dW.T to dW. PyTorch stores W.grad in the same layout as W. If your manual dW_manual is computed as x.T @ dy (shape (64, 600)), you'll need .T. Check shapes before comparing.

Stretch

  • Add a second linear layer y2 = linear(y, W2, b2) and derive the full two-layer backward. Compare to autograd.
  • Replace the squared loss with cross-entropy + softmax over the 600 grammar classes. Derive ∂L/∂y = softmax(y) - one_hot(target). This is the formula Phase 18 actually trains against.
  • Use torch.autograd.gradcheck instead of finite differences for verification.

Next lab: lab/02-custom-op.md.