English · Español
Lab 01 — Autograd by hand for nn.Linear(64, 600)¶
🇪🇸 Derivas a mano los gradientes de un
Linear(64, 600)— los tres:∂L/∂x,∂L/∂W,∂L/∂b. Luego ejecutasloss.backward()en PyTorch y comparas. El umbral es 1e-7 a fp32. Si no cuadra, lo arreglas. Esta es la práctica que cierra la convicción "PyTorch autograd es exactamente lo que construimos en Fase ⅞ — más grande, no diferente".
Objective¶
Derive by hand the backward formulas for y = linear(x, W, b) = x @ W.T + b followed by a scalar loss L = (y - target).pow(2).sum() / 2. Compute ∂L/∂x, ∂L/∂W, ∂L/∂b analytically, then verify against PyTorch's autograd at fp32 within element-wise 1e-7.
Setup¶
- Phase ⅞ (scalar/tensor autograd) and Phase 04 (calculus theory).
- The forward shapes:
x ∈ R^(2 × 64),W ∈ R^(600 × 64),b ∈ R^(600),y ∈ R^(2 × 600),target ∈ R^(2 × 600),L ∈ R.
The math¶
Forward:
Loss:
Chain rule:
- \(\dfrac{\partial L}{\partial x} = \dfrac{\partial L}{\partial y} \cdot \dfrac{\partial y}{\partial x} = (y - t) W\) (shape
(2, 64)) - \(\dfrac{\partial L}{\partial W} = (y - t)^T x\) (shape
(600, 64)) - \(\dfrac{\partial L}{\partial b} = \sum_i (y_i - t_i)\) (shape
(600,))
These are the formulas the AddmmBackward0 node computes internally. The lab verifies that.
Tasks¶
Part A — Forward + analytical backward¶
import torch
torch.manual_seed(42)
x = torch.randn(2, 64, requires_grad=True)
W = torch.randn(600, 64, requires_grad=True)
b = torch.randn(600, requires_grad=True)
target = torch.randn(2, 600)
y = torch.nn.functional.linear(x, W, b)
loss = 0.5 * ((y - target) ** 2).sum()
# Analytical gradients (no autograd):
with torch.no_grad():
dy = y - target # (2, 600)
dx_manual = dy @ W # (2, 64)
dW_manual = dy.T @ x # (600, 64)
db_manual = dy.sum(dim=0) # (600,)
Part B — PyTorch autograd¶
loss.backward()
print("dx max-err:", (x.grad - dx_manual).abs().max().item())
print("dW max-err:", (W.grad - dW_manual).abs().max().item())
print("db max-err:", (b.grad - db_manual).abs().max().item())
All three max-errors should be < 1e-5 at fp32 (and typically < 1e-7).
Part C — Walk the grad_fn chain¶
node = loss.grad_fn
while node is not None:
print(type(node).__name__, [t for t in node.next_functions])
nexts = [t[0] for t in node.next_functions if t[0] is not None]
node = nexts[0] if nexts else None
Expected output (approximate; names vary by version):
DivBackward0 [(SumBackward0, 0)] # the 0.5 *
SumBackward0 [(PowBackward0, 0)]
PowBackward0 [(SubBackward0, 0)]
SubBackward0 [(AddmmBackward0, 0), (None, 0)]
AddmmBackward0 [(AccumulateGrad, 0), (AccumulateGrad, 0), (TBackward0, 0)]
Identify:
- AddmmBackward0 — the matmul-and-bias-add node.
- AccumulateGrad — leaf nodes that accumulate gradients into .grad.
- TBackward0 — the implicit transpose W → W^T that linear inserted.
Part D — Verify the 1e-7 claim at fp32¶
torch.manual_seed(123)
errors = []
for _ in range(20):
x = torch.randn(2, 64, requires_grad=True)
W = torch.randn(600, 64, requires_grad=True)
b = torch.randn(600, requires_grad=True)
target = torch.randn(2, 600)
y = torch.nn.functional.linear(x, W, b)
loss = 0.5 * ((y - target) ** 2).sum()
with torch.no_grad():
dy = y - target
dx_m, dW_m, db_m = dy @ W, dy.T @ x, dy.sum(dim=0)
loss.backward()
errors.append((
(x.grad - dx_m).abs().max().item(),
(W.grad - dW_m).abs().max().item(),
(b.grad - db_m).abs().max().item(),
))
import numpy as np
e = np.array(errors)
print("dx: max", e[:, 0].max(), " median", float(np.median(e[:, 0])))
print("dW: max", e[:, 1].max(), " median", float(np.median(e[:, 1])))
print("db: max", e[:, 2].max(), " median", float(np.median(e[:, 2])))
Expected: medians are ~ 1e-7, maxes are < 1e-5. The reason it's not 0.0 even though the formula is identical: floating-point summation order differs between PyTorch's addmm and your @. Document this.
Part E — Repeat at fp16, observe the degradation¶
Expected: errors are now 1e-3 or worse. The order-of-summation problem amplifies in low precision. This is the canonical reason fp16 training needs loss-scaling and mixed precision (Phase 18 mentioned it; Phase 26 dives in).
Part F — Write the report¶
experiments/25-autograd-by-hand/REPORT.md:
- The math (LaTeX-rendered, three formulas).
- The error table from Part D (20 runs, median + max per gradient).
- The fp16 result from Part E with a 2-sentence explanation of why precision matters.
- The
grad_fnchain printout from Part C. - One paragraph: "PyTorch's autograd computed the same formulas I wrote by hand. The deviation at fp32 is
< 1e-5, dominated by summation-order differences. At fp16 the deviation is1e-3, large enough to affect convergence — this is the failure mode mixed-precision training addresses."
Deliverable¶
experiments/25-autograd-by-hand/:
- REPORT.md — items above.
- errors.csv — the 20-run × 3-gradient error table.
- manifest.json.
Acceptance¶
- fp32: all 20 runs × 3 gradients have max-err
< 1e-5. - fp16: at least one gradient has max-err
> 1e-3(proves the precision sensitivity). - The grad_fn chain printout identifies
AddmmBackward0,AccumulateGrad, and the transpose. - The interpretation paragraph correctly attributes the fp32 discrepancy to summation order.
Pitfalls¶
- Wrong transpose direction.
linear(x, W, b) = x @ W.T + b. If you writex @ W + b, shapes won't match (x: (2,64),W: (600,64), can't matmul). - Forgetting
0.5 *in the loss. Then∂L/∂y = 2(y - t), not(y - t). Match the constant in the loss. - Recomputing
loss.backward()without zeroing.grad. Gradients accumulate; the second call doubles the answer. Usex.grad.zero_()between runs, or rebuild the tensors fresh. - fp16 producing
nan. Largertargetmagnitudes overflow squared. Scaletargetby 0.1 if you seeinf. - Comparing
dW.TtodW. PyTorch storesW.gradin the same layout asW. If your manualdW_manualis computed asx.T @ dy(shape(64, 600)), you'll need.T. Check shapes before comparing.
Stretch¶
- Add a second linear layer
y2 = linear(y, W2, b2)and derive the full two-layer backward. Compare to autograd. - Replace the squared loss with cross-entropy + softmax over the 600 grammar classes. Derive
∂L/∂y = softmax(y) - one_hot(target). This is the formula Phase 18 actually trains against. - Use
torch.autograd.gradcheckinstead of finite differences for verification.
Next lab: lab/02-custom-op.md.