English · Español

Lab 01 — Autograd by hand for `nn.Linear(64, 600)`¶

🇪🇸 Derivas a mano los gradientes de un Linear(64, 600) — los tres: ∂L/∂x, ∂L/∂W, ∂L/∂b. Luego ejecutas loss.backward() en PyTorch y comparas. El umbral es 1e-7 a fp32. Si no cuadra, lo arreglas. Esta es la práctica que cierra la convicción "PyTorch autograd es exactamente lo que construimos en Fase ⅞ — más grande, no diferente".

Objective¶

Derive by hand the backward formulas for y = linear(x, W, b) = x @ W.T + b followed by a scalar loss L = (y - target).pow(2).sum() / 2. Compute ∂L/∂x, ∂L/∂W, ∂L/∂b analytically, then verify against PyTorch's autograd at fp32 within element-wise 1e-7.

Setup¶

Phase ⅞ (scalar/tensor autograd) and Phase 04 (calculus theory).
The forward shapes: x ∈ R^(2 × 64), W ∈ R^(600 × 64), b ∈ R^(600), y ∈ R^(2 × 600), target ∈ R^(2 × 600), L ∈ R.

The math¶

Forward:

\[y = x W^T + b \qquad y_{i,j} = \sum_k x_{i,k} W_{j,k} + b_j\]

Loss:

\[L = \tfrac{1}{2}\sum_{i,j} (y_{i,j} - t_{i,j})^2 \qquad \frac{\partial L}{\partial y} = y - t\]

Chain rule:

\(\dfrac{\partial L}{\partial x} = \dfrac{\partial L}{\partial y} \cdot \dfrac{\partial y}{\partial x} = (y - t) W\) (shape (2, 64))
\(\dfrac{\partial L}{\partial W} = (y - t)^T x\) (shape (600, 64))
\(\dfrac{\partial L}{\partial b} = \sum_i (y_i - t_i)\) (shape (600,))

These are the formulas the AddmmBackward0 node computes internally. The lab verifies that.

Tasks¶

Part A — Forward + analytical backward¶

import torch
torch.manual_seed(42)

x = torch.randn(2, 64, requires_grad=True)
W = torch.randn(600, 64, requires_grad=True)
b = torch.randn(600, requires_grad=True)
target = torch.randn(2, 600)

y = torch.nn.functional.linear(x, W, b)
loss = 0.5 * ((y - target) ** 2).sum()

# Analytical gradients (no autograd):
with torch.no_grad():
    dy = y - target                          # (2, 600)
    dx_manual = dy @ W                       # (2, 64)
    dW_manual = dy.T @ x                     # (600, 64)
    db_manual = dy.sum(dim=0)                # (600,)

Part B — PyTorch autograd¶

loss.backward()

print("dx max-err:", (x.grad - dx_manual).abs().max().item())
print("dW max-err:", (W.grad - dW_manual).abs().max().item())
print("db max-err:", (b.grad - db_manual).abs().max().item())

All three max-errors should be < 1e-5 at fp32 (and typically < 1e-7).

Part C — Walk the `grad_fn` chain¶

node = loss.grad_fn
while node is not None:
    print(type(node).__name__, [t for t in node.next_functions])
    nexts = [t[0] for t in node.next_functions if t[0] is not None]
    node = nexts[0] if nexts else None

Expected output (approximate; names vary by version):

DivBackward0   [(SumBackward0, 0)]            # the 0.5 *
SumBackward0   [(PowBackward0, 0)]
PowBackward0   [(SubBackward0, 0)]
SubBackward0   [(AddmmBackward0, 0), (None, 0)]
AddmmBackward0 [(AccumulateGrad, 0), (AccumulateGrad, 0), (TBackward0, 0)]

Identify: - AddmmBackward0 — the matmul-and-bias-add node. - AccumulateGrad — leaf nodes that accumulate gradients into .grad. - TBackward0 — the implicit transpose W → W^T that linear inserted.

Part D — Verify the `1e-7` claim at fp32¶

torch.manual_seed(123)
errors = []
for _ in range(20):
    x = torch.randn(2, 64, requires_grad=True)
    W = torch.randn(600, 64, requires_grad=True)
    b = torch.randn(600, requires_grad=True)
    target = torch.randn(2, 600)

    y = torch.nn.functional.linear(x, W, b)
    loss = 0.5 * ((y - target) ** 2).sum()
    with torch.no_grad():
        dy = y - target
        dx_m, dW_m, db_m = dy @ W, dy.T @ x, dy.sum(dim=0)
    loss.backward()

    errors.append((
        (x.grad - dx_m).abs().max().item(),
        (W.grad - dW_m).abs().max().item(),
        (b.grad - db_m).abs().max().item(),
    ))

import numpy as np
e = np.array(errors)
print("dx: max", e[:, 0].max(), " median", float(np.median(e[:, 0])))
print("dW: max", e[:, 1].max(), " median", float(np.median(e[:, 1])))
print("db: max", e[:, 2].max(), " median", float(np.median(e[:, 2])))

Expected: medians are ~ 1e-7, maxes are < 1e-5. The reason it's not 0.0 even though the formula is identical: floating-point summation order differs between PyTorch's addmm and your @. Document this.

Part E — Repeat at fp16, observe the degradation¶

x = torch.randn(2, 64, dtype=torch.float16, requires_grad=True)
# ... same as Part D

Expected: errors are now 1e-3 or worse. The order-of-summation problem amplifies in low precision. This is the canonical reason fp16 training needs loss-scaling and mixed precision (Phase 18 mentioned it; Phase 26 dives in).

Part F — Write the report¶

experiments/25-autograd-by-hand/REPORT.md:

The math (LaTeX-rendered, three formulas).
The error table from Part D (20 runs, median + max per gradient).
The fp16 result from Part E with a 2-sentence explanation of why precision matters.
The grad_fn chain printout from Part C.
One paragraph: "PyTorch's autograd computed the same formulas I wrote by hand. The deviation at fp32 is < 1e-5, dominated by summation-order differences. At fp16 the deviation is 1e-3, large enough to affect convergence — this is the failure mode mixed-precision training addresses."

Deliverable¶

experiments/25-autograd-by-hand/: - REPORT.md — items above. - errors.csv — the 20-run × 3-gradient error table. - manifest.json.

Acceptance¶

fp32: all 20 runs × 3 gradients have max-err < 1e-5.
fp16: at least one gradient has max-err > 1e-3 (proves the precision sensitivity).
The grad_fn chain printout identifies AddmmBackward0, AccumulateGrad, and the transpose.
The interpretation paragraph correctly attributes the fp32 discrepancy to summation order.

Pitfalls¶

Wrong transpose direction. linear(x, W, b) = x @ W.T + b. If you write x @ W + b, shapes won't match (x: (2,64), W: (600,64), can't matmul).
Forgetting 0.5 * in the loss. Then ∂L/∂y = 2(y - t), not (y - t). Match the constant in the loss.
Recomputing loss.backward() without zeroing .grad. Gradients accumulate; the second call doubles the answer. Use x.grad.zero_() between runs, or rebuild the tensors fresh.
fp16 producing nan. Larger target magnitudes overflow squared. Scale target by 0.1 if you see inf.
Comparing dW.T to dW. PyTorch stores W.grad in the same layout as W. If your manual dW_manual is computed as x.T @ dy (shape (64, 600)), you'll need .T. Check shapes before comparing.

Stretch¶

Add a second linear layer y2 = linear(y, W2, b2) and derive the full two-layer backward. Compare to autograd.
Replace the squared loss with cross-entropy + softmax over the 600 grammar classes. Derive ∂L/∂y = softmax(y) - one_hot(target). This is the formula Phase 18 actually trains against.
Use torch.autograd.gradcheck instead of finite differences for verification.

Next lab: lab/02-custom-op.md.

Lab 01 — Autograd by hand for nn.Linear(64, 600)¶