English · Español

∂x` on paper, then verify numerically¶

🇪🇸 La página más importante de la Fase 4. Derivas a mano la Jacobiana de softmax (diag(p) - p p^T) y el resultado limpio softmax(x) - one_hot(y) para cross-entropy. Después, comparas con diferencias finitas.

Objective¶

Two derivations on paper, then a one-screen numerical verification. By the end you should never have to "look up" ∂ CE/∂x again — it's just softmax(x) - one_hot(y), and you'll have constructed that result yourself.

Setup¶

A blank notebook page.
numpy, the log-sum-exp from Phase 05.
A 5-element synthetic logit vector z = np.array([2.0, 1.0, 0.5, -1.0, 0.0]).

Tasks¶

Part A — Derive on paper (no code)¶

Softmax derivative. Starting from p_i = exp(z_i) / Σ_k exp(z_k), derive ∂p_i / ∂z_j for both i = j and i ≠ j. Get to:

$$\frac{\partial p_i}{\partial z_j} = p_i (\delta_{ij} - p_j)$$

Write the matrix form explicitly: J = diag(p) - p p^T.

Cross-entropy + softmax composed. Let L = -log p_y where y is the true class index. Derive ∂L / ∂z_j. Hint: use ∂L/∂z_j = Σ_i (∂L/∂p_i)(∂p_i/∂z_j). Most terms vanish (∂L/∂p_i = 0 for i ≠ y; = -1/p_y for i = y). After substitution and simplification, you should arrive at:

$$\frac{\partial L}{\partial z_j} = p_j - \mathbb{1}[j = y]$$

That is: ∂CE/∂z = softmax(z) - one_hot(y). Beautiful.

Sanity check on paper. For z = [2, 1, 0] and y = 0, compute p = softmax(z) numerically by hand (just the proportions; don't compute the exp values). Verify that p_0 - 1 < 0 and p_1, p_2 > 0 — the gradient pushes z_0 up and z_1, z_2 down, which is what we want.

Part B — Numerical verification¶

Implement softmax (use Phase 05's log-sum-exp):

def softmax(z):
    z_max = z.max()
    e = np.exp(z - z_max)
    return e / e.sum()

Compute the Jacobian analytically:

def softmax_jacobian(z):
    p = softmax(z)
    return np.diag(p) - np.outer(p, p)

Compute the Jacobian numerically via centred finite differences:

def softmax_jacobian_fd(z, h=1e-5):
    n = len(z)
    J = np.zeros((n, n))
    for j in range(n):
        e = np.zeros(n); e[j] = h
        J[:, j] = (softmax(z + e) - softmax(z - e)) / (2 * h)
    return J

Compare. For z = [2.0, 1.0, 0.5, -1.0, 0.0]:

J_analytical = softmax_jacobian(z)
J_numerical = softmax_jacobian_fd(z)
max_err = np.max(np.abs(J_analytical - J_numerical))
assert max_err < 1e-7, f"Jacobian mismatch: {max_err}"

Same exercise for CE:

def ce_loss(z, y):
    return -np.log(softmax(z)[y])

def ce_grad(z, y):
    p = softmax(z)
    g = p.copy()
    g[y] -= 1.0
    return g

def ce_grad_fd(z, y, h=1e-5):
    g = np.zeros(len(z))
    for j in range(len(z)):
        e = np.zeros(len(z)); e[j] = h
        g[j] = (ce_loss(z + e, y) - ce_loss(z - e, y)) / (2 * h)
    return g

for y in range(5):
    diff = np.max(np.abs(ce_grad(z, y) - ce_grad_fd(z, y)))
    assert diff < 1e-7

Print the Jacobian. For visual reinforcement, print J_analytical as a matrix and verify by eye that the diagonal is p_i (1 - p_i) (max on the largest p) and off-diagonals are -p_i p_j (small negative).

Deliverable¶

learners/borja/phase-04/lab-00-softmax-gradient.md containing: - A photo or transcription of the paper derivation (both parts of Part A). - The verification script's output (the assert messages or printed maxima). - A 3-sentence reflection: did the derivation feel mechanical or insightful? Where did you get stuck?

Acceptance¶

Both derivations completed on paper before any code is written.
softmax_jacobian matches finite differences within 1e-7.
ce_grad matches finite differences within 1e-7.
Reflection written.

Pitfalls¶

Skipping the paper part. The code is trivial once the derivation is internalised; the value of this lab is the derivation. Do it on paper.
Forward differences instead of centred. Forward is O(h); centred is O(h²). With h = 1e-5, forward gives ~1e-5 error, centred ~1e-10. Use centred.
h too small. h = 1e-12 triggers catastrophic cancellation; you'll get worse gradients. 1e-5 is the sweet spot for fp64; for fp32, use 1e-3.
Using np.log(softmax(z)) for cross-entropy. Numerically unstable. Use log_softmax (logsumexp form) from Phase 05.
Forgetting that the Jacobian of softmax is symmetric. diag(p) - p p^T is symmetric; if your code produces a non-symmetric matrix, you have a bug.

Next: 01-jacobian-by-hand.md

Lab 00 — Derive ∂ softmax/∂x and ∂ CE/∂x on paper, then verify numerically¶