English · Español
Lab 01 — Jacobian of a tiny MLP, analytical vs numerical¶
🇪🇸 Una MLP de dos capas. Calculas
∂y/∂W1de dos maneras: con regla de la cadena (matmul de Jacobianas) y con diferencias finitas. Las dos deben coincidir dentro de1e-4. Si no, lo arreglas — porque cuando Fase 7 implemente autograd, esta misma derivación tiene que producir el mismo número.
Objective¶
For a 2-layer MLP y = W₂ · relu(W₁ x + b₁) + b₂, compute the Jacobian ∂y/∂W₁ analytically (using the multivariate chain rule from theory 02) and numerically (finite differences). Verify they match within 1e-4. This is the proof of concept for Phase 7's scalar autograd.
Setup¶
- Shapes:
x ∈ R³,W₁ ∈ R^{4×3},b₁ ∈ R⁴,W₂ ∈ R^{2×4},b₂ ∈ R². Soy ∈ R². - Total params in
W₁:4 × 3 = 12. We'll compute the2 × 12Jacobian∂y/∂vec(W₁). - Use deterministic random init:
np.random.default_rng(seed=42).
Tasks¶
Part A — The MLP and forward pass¶
import numpy as np
rng = np.random.default_rng(42)
x = rng.standard_normal(3)
W1 = rng.standard_normal((4, 3))
b1 = rng.standard_normal(4)
W2 = rng.standard_normal((2, 4))
b2 = rng.standard_normal(2)
def mlp(x, W1, b1, W2, b2):
h_pre = W1 @ x + b1 # (4,)
h = np.maximum(0, h_pre) # ReLU, (4,)
y = W2 @ h + b2 # (2,)
return y, h_pre, h # return intermediates for analytical Jacobian
Part B — Analytical Jacobian¶
Apply the chain rule step by step.
∂y/∂h = W₂(shape(2, 4)).∂h/∂h_pre = diag(h_pre > 0)— ReLU is1whereh_pre > 0and0elsewhere. (Shape(4, 4).)∂h_pre/∂W₁: this is where it gets shape-careful.h_pre = W₁ x. Elementiofh_preisΣ_j W₁_{ij} x_j. So∂h_pre_i / ∂W₁_{ab} = δ_{ia} x_b. FlattenW₁tovec(W₁) ∈ R^{12}by row-major ordering: index(a, b)→3a + b. Then∂h_pre/∂vec(W₁)is a(4, 12)matrix where rowihasx_bat column3i + band zero elsewhere.
def dh_pre_dW1(x):
J = np.zeros((4, 12))
for i in range(4):
for b in range(3):
J[i, 3*i + b] = x[b]
return J
- Chain it all together:
def analytical_jacobian(x, W1, b1, W2, b2):
y, h_pre, h = mlp(x, W1, b1, W2, b2)
dy_dh = W2 # (2, 4)
dh_dh_pre = np.diag((h_pre > 0).astype(float)) # (4, 4)
dh_pre_dW1_mat = dh_pre_dW1(x) # (4, 12)
return dy_dh @ dh_dh_pre @ dh_pre_dW1_mat # (2, 12)
Part C — Numerical Jacobian (finite differences)¶
def numerical_jacobian(x, W1, b1, W2, b2, h=1e-5):
y_base, _, _ = mlp(x, W1, b1, W2, b2)
J = np.zeros((2, 12))
for k in range(12):
a, b = divmod(k, 3)
W1_plus = W1.copy(); W1_plus[a, b] += h
W1_minus = W1.copy(); W1_minus[a, b] -= h
y_plus, _, _ = mlp(x, W1_plus, b1, W2, b2)
y_minus, _, _ = mlp(x, W1_minus, b1, W2, b2)
J[:, k] = (y_plus - y_minus) / (2 * h)
return J
Part D — Compare¶
J_an = analytical_jacobian(x, W1, b1, W2, b2)
J_num = numerical_jacobian(x, W1, b1, W2, b2)
max_err = np.max(np.abs(J_an - J_num))
print(f"Max element-wise error: {max_err:.2e}")
assert max_err < 1e-4, f"Jacobians disagree: {max_err}"
If max_err < 1e-7, great — both methods agree to fp64 precision (modulo finite-difference truncation).
Part E — Edge case: ReLU at zero¶
The ReLU derivative is 1 for positive input, 0 for negative. At exactly zero, it's undefined — the convention is 0 (since np.maximum(0, x) gives 0). Test the comparison with a deliberately ReLU-zero-crossing input:
# Force h_pre[0] to be exactly 0 via a crafted W1, b1
b1_zeroed = b1.copy()
b1_zeroed[0] = -W1[0] @ x # makes h_pre[0] = 0 exactly
# Re-run the comparison
Document the result: at h_pre = 0, the analytical method says ∂h/∂h_pre = 0; finite differences gives 0 from the right and 1 from the left, averaging to ~0.5. Centred FD will give ~0.5 while analytical gives 0. Disagreement at a measure-zero set is expected and OK.
Deliverable¶
learners/borja/phase-04/lab-01-jacobian.md:
- Code listing.
- Output of max_err.
- 3-sentence reflection on the ReLU-zero edge case.
Acceptance¶
- Analytical and numerical Jacobians match within
1e-4for the random-seed input. - Match within
1e-7for typical inputs (where ReLU is not at zero). - The ReLU-zero case is documented but not "fixed" — it's a genuine non-differentiability.
Pitfalls¶
- Row-major vs column-major flattening. NumPy's
.flatten()defaults to row-major (C order). Be explicit. If yourdh_pre/dW1uses one convention and yourvec(W1)uses the other, you'll get garbage. - Forgetting to copy
W1before perturbation.W1 += h_vecmodifies the original; subsequent perturbations stack. AlwaysW1_plus = W1.copy(); W1_plus[a, b] += h. - Using forward differences. O(h) error; for
h = 1e-5, gives ~1e-5error — fails the1e-7threshold for "typical" inputs even when the math is correct. - ReLU derivative confusion. At
h_pre < 0, the derivative is0and the input doesn't influence the output. Some people write∂relu/∂x = (x > 0).astype(float)(correct); some write(x >= 0)(off-by-one at zero). Stick with strict>. - Tiny
hcausing cancellation.h = 1e-12gives nonsense due to floating-point cancellation inf(x+h) - f(x-h). Useh = 1e-5.
Stretch¶
- Extend to compute
∂y/∂b₁and∂y/∂W₂analytically and numerically. They should also match within1e-4. - Implement the full gradient
∂L/∂W₁whereL = ‖y − target‖²/2for some random target. This is what Phase 7's autograd will produce automatically.