English · Español
Lab 00 — Derive ∂ softmax/∂x and ∂ CE/∂x on paper, then verify numerically¶
🇪🇸 La página más importante de la Fase 4. Derivas a mano la Jacobiana de softmax (
diag(p) - p p^T) y el resultado limpiosoftmax(x) - one_hot(y)para cross-entropy. Después, comparas con diferencias finitas.
Objective¶
Two derivations on paper, then a one-screen numerical verification. By the end you should never have to "look up" ∂ CE/∂x again — it's just softmax(x) - one_hot(y), and you'll have constructed that result yourself.
Setup¶
- A blank notebook page.
numpy, the log-sum-exp from Phase 05.- A 5-element synthetic logit vector
z = np.array([2.0, 1.0, 0.5, -1.0, 0.0]).
Tasks¶
Part A — Derive on paper (no code)¶
- Softmax derivative. Starting from
p_i = exp(z_i) / Σ_k exp(z_k), derive∂p_i / ∂z_jfor bothi = jandi ≠ j. Get to:
$\(\frac{\partial p_i}{\partial z_j} = p_i (\delta_{ij} - p_j)\)$
Write the matrix form explicitly: J = diag(p) - p p^T.
- Cross-entropy + softmax composed. Let
L = -log p_ywhereyis the true class index. Derive∂L / ∂z_j. Hint: use∂L/∂z_j = Σ_i (∂L/∂p_i)(∂p_i/∂z_j). Most terms vanish (∂L/∂p_i = 0fori ≠ y;= -1/p_yfori = y). After substitution and simplification, you should arrive at:
$\(\frac{\partial L}{\partial z_j} = p_j - \mathbb{1}[j = y]\)$
That is: ∂CE/∂z = softmax(z) - one_hot(y). Beautiful.
- Sanity check on paper. For
z = [2, 1, 0]andy = 0, computep = softmax(z)numerically by hand (just the proportions; don't compute the exp values). Verify thatp_0 - 1 < 0andp_1, p_2 > 0— the gradient pushesz_0up andz_1, z_2down, which is what we want.
Part B — Numerical verification¶
- Implement softmax (use Phase 05's log-sum-exp):
- Compute the Jacobian analytically:
- Compute the Jacobian numerically via centred finite differences:
def softmax_jacobian_fd(z, h=1e-5):
n = len(z)
J = np.zeros((n, n))
for j in range(n):
e = np.zeros(n); e[j] = h
J[:, j] = (softmax(z + e) - softmax(z - e)) / (2 * h)
return J
- Compare. For
z = [2.0, 1.0, 0.5, -1.0, 0.0]:
J_analytical = softmax_jacobian(z)
J_numerical = softmax_jacobian_fd(z)
max_err = np.max(np.abs(J_analytical - J_numerical))
assert max_err < 1e-7, f"Jacobian mismatch: {max_err}"
- Same exercise for CE:
def ce_loss(z, y):
return -np.log(softmax(z)[y])
def ce_grad(z, y):
p = softmax(z)
g = p.copy()
g[y] -= 1.0
return g
def ce_grad_fd(z, y, h=1e-5):
g = np.zeros(len(z))
for j in range(len(z)):
e = np.zeros(len(z)); e[j] = h
g[j] = (ce_loss(z + e, y) - ce_loss(z - e, y)) / (2 * h)
return g
for y in range(5):
diff = np.max(np.abs(ce_grad(z, y) - ce_grad_fd(z, y)))
assert diff < 1e-7
- Print the Jacobian. For visual reinforcement, print
J_analyticalas a matrix and verify by eye that the diagonal isp_i (1 - p_i)(max on the largestp) and off-diagonals are-p_i p_j(small negative).
Deliverable¶
learners/borja/phase-04/lab-00-softmax-gradient.md containing:
- A photo or transcription of the paper derivation (both parts of Part A).
- The verification script's output (the assert messages or printed maxima).
- A 3-sentence reflection: did the derivation feel mechanical or insightful? Where did you get stuck?
Acceptance¶
- Both derivations completed on paper before any code is written.
softmax_jacobianmatches finite differences within1e-7.ce_gradmatches finite differences within1e-7.- Reflection written.
Pitfalls¶
- Skipping the paper part. The code is trivial once the derivation is internalised; the value of this lab is the derivation. Do it on paper.
- Forward differences instead of centred. Forward is
O(h); centred isO(h²). Withh = 1e-5, forward gives ~1e-5error, centred ~1e-10. Use centred. htoo small.h = 1e-12triggers catastrophic cancellation; you'll get worse gradients.1e-5is the sweet spot for fp64; for fp32, use1e-3.- Using
np.log(softmax(z))for cross-entropy. Numerically unstable. Uselog_softmax(logsumexp form) from Phase 05. - Forgetting that the Jacobian of softmax is symmetric.
diag(p) - p p^Tis symmetric; if your code produces a non-symmetric matrix, you have a bug.
Next: 01-jacobian-by-hand.md