Skip to content

English · Español

Lab 01 — Jacobian of a tiny MLP, analytical vs numerical

🇪🇸 Una MLP de dos capas. Calculas ∂y/∂W1 de dos maneras: con regla de la cadena (matmul de Jacobianas) y con diferencias finitas. Las dos deben coincidir dentro de 1e-4. Si no, lo arreglas — porque cuando Fase 7 implemente autograd, esta misma derivación tiene que producir el mismo número.

Objective

For a 2-layer MLP y = W₂ · relu(W₁ x + b₁) + b₂, compute the Jacobian ∂y/∂W₁ analytically (using the multivariate chain rule from theory 02) and numerically (finite differences). Verify they match within 1e-4. This is the proof of concept for Phase 7's scalar autograd.

Setup

  • Shapes: x ∈ R³, W₁ ∈ R^{4×3}, b₁ ∈ R⁴, W₂ ∈ R^{2×4}, b₂ ∈ R². So y ∈ R².
  • Total params in W₁: 4 × 3 = 12. We'll compute the 2 × 12 Jacobian ∂y/∂vec(W₁).
  • Use deterministic random init: np.random.default_rng(seed=42).

Tasks

Part A — The MLP and forward pass

import numpy as np

rng = np.random.default_rng(42)
x = rng.standard_normal(3)
W1 = rng.standard_normal((4, 3))
b1 = rng.standard_normal(4)
W2 = rng.standard_normal((2, 4))
b2 = rng.standard_normal(2)

def mlp(x, W1, b1, W2, b2):
    h_pre = W1 @ x + b1     # (4,)
    h = np.maximum(0, h_pre) # ReLU, (4,)
    y = W2 @ h + b2          # (2,)
    return y, h_pre, h        # return intermediates for analytical Jacobian

Part B — Analytical Jacobian

Apply the chain rule step by step.

  1. ∂y/∂h = W₂ (shape (2, 4)).
  2. ∂h/∂h_pre = diag(h_pre > 0) — ReLU is 1 where h_pre > 0 and 0 elsewhere. (Shape (4, 4).)
  3. ∂h_pre/∂W₁: this is where it gets shape-careful. h_pre = W₁ x. Element i of h_pre is Σ_j W₁_{ij} x_j. So ∂h_pre_i / ∂W₁_{ab} = δ_{ia} x_b. Flatten W₁ to vec(W₁) ∈ R^{12} by row-major ordering: index (a, b)3a + b. Then ∂h_pre/∂vec(W₁) is a (4, 12) matrix where row i has x_b at column 3i + b and zero elsewhere.
def dh_pre_dW1(x):
    J = np.zeros((4, 12))
    for i in range(4):
        for b in range(3):
            J[i, 3*i + b] = x[b]
    return J
  1. Chain it all together:
def analytical_jacobian(x, W1, b1, W2, b2):
    y, h_pre, h = mlp(x, W1, b1, W2, b2)
    dy_dh = W2                          # (2, 4)
    dh_dh_pre = np.diag((h_pre > 0).astype(float))  # (4, 4)
    dh_pre_dW1_mat = dh_pre_dW1(x)      # (4, 12)
    return dy_dh @ dh_dh_pre @ dh_pre_dW1_mat  # (2, 12)

Part C — Numerical Jacobian (finite differences)

def numerical_jacobian(x, W1, b1, W2, b2, h=1e-5):
    y_base, _, _ = mlp(x, W1, b1, W2, b2)
    J = np.zeros((2, 12))
    for k in range(12):
        a, b = divmod(k, 3)
        W1_plus = W1.copy(); W1_plus[a, b] += h
        W1_minus = W1.copy(); W1_minus[a, b] -= h
        y_plus, _, _ = mlp(x, W1_plus, b1, W2, b2)
        y_minus, _, _ = mlp(x, W1_minus, b1, W2, b2)
        J[:, k] = (y_plus - y_minus) / (2 * h)
    return J

Part D — Compare

J_an = analytical_jacobian(x, W1, b1, W2, b2)
J_num = numerical_jacobian(x, W1, b1, W2, b2)
max_err = np.max(np.abs(J_an - J_num))
print(f"Max element-wise error: {max_err:.2e}")
assert max_err < 1e-4, f"Jacobians disagree: {max_err}"

If max_err < 1e-7, great — both methods agree to fp64 precision (modulo finite-difference truncation).

Part E — Edge case: ReLU at zero

The ReLU derivative is 1 for positive input, 0 for negative. At exactly zero, it's undefined — the convention is 0 (since np.maximum(0, x) gives 0). Test the comparison with a deliberately ReLU-zero-crossing input:

# Force h_pre[0] to be exactly 0 via a crafted W1, b1
b1_zeroed = b1.copy()
b1_zeroed[0] = -W1[0] @ x  # makes h_pre[0] = 0 exactly
# Re-run the comparison

Document the result: at h_pre = 0, the analytical method says ∂h/∂h_pre = 0; finite differences gives 0 from the right and 1 from the left, averaging to ~0.5. Centred FD will give ~0.5 while analytical gives 0. Disagreement at a measure-zero set is expected and OK.

Deliverable

learners/borja/phase-04/lab-01-jacobian.md: - Code listing. - Output of max_err. - 3-sentence reflection on the ReLU-zero edge case.

Acceptance

  • Analytical and numerical Jacobians match within 1e-4 for the random-seed input.
  • Match within 1e-7 for typical inputs (where ReLU is not at zero).
  • The ReLU-zero case is documented but not "fixed" — it's a genuine non-differentiability.

Pitfalls

  • Row-major vs column-major flattening. NumPy's .flatten() defaults to row-major (C order). Be explicit. If your dh_pre/dW1 uses one convention and your vec(W1) uses the other, you'll get garbage.
  • Forgetting to copy W1 before perturbation. W1 += h_vec modifies the original; subsequent perturbations stack. Always W1_plus = W1.copy(); W1_plus[a, b] += h.
  • Using forward differences. O(h) error; for h = 1e-5, gives ~1e-5 error — fails the 1e-7 threshold for "typical" inputs even when the math is correct.
  • ReLU derivative confusion. At h_pre < 0, the derivative is 0 and the input doesn't influence the output. Some people write ∂relu/∂x = (x > 0).astype(float) (correct); some write (x >= 0) (off-by-one at zero). Stick with strict >.
  • Tiny h causing cancellation. h = 1e-12 gives nonsense due to floating-point cancellation in f(x+h) - f(x-h). Use h = 1e-5.

Stretch

  • Extend to compute ∂y/∂b₁ and ∂y/∂W₂ analytically and numerically. They should also match within 1e-4.
  • Implement the full gradient ∂L/∂W₁ where L = ‖y − target‖²/2 for some random target. This is what Phase 7's autograd will produce automatically.

Next: 02-optimizers-on-rosenbrock.md