English · Español
01 — Derivatives, gradients, Jacobians¶
🇪🇸 La derivada en una variable es una pendiente; el gradiente en varias variables es un vector que apunta a la subida más rápida; el Jacobiano de una función vectorial es la matriz que generaliza ambos. Todo backprop opera sobre estos tres objetos.
Single-variable derivative¶
For f: R → R, the derivative at x is
Geometric: slope of the tangent line at x. Numeric: how much f changes per unit change in x, in the limit.
Useful derivatives to memorize cold (any later derivation goes faster if these are reflexive):
f(x) |
f'(x) |
|---|---|
x^n |
n x^{n-1} |
e^x |
e^x |
log x |
1/x |
sin x |
cos x |
cos x |
-sin x |
tanh x |
1 - tanh² x |
relu x |
1 if x > 0 else 0 (sub-gradient at 0) |
sigmoid x |
sigmoid(x) (1 - sigmoid(x)) |
softplus x |
sigmoid x |
1/x |
-1/x² |
The last six get hand-derived in lab 00 as warm-ups for the softmax derivation.
Partial derivative¶
For f: R^n → R:
Treat all other coordinates as constants; differentiate as in the single-variable case.
Worked example. f(x, y) = x² y + sin(y). Then:
Gradient¶
The gradient of f: R^n → R is the vector of partial derivatives:
By convention a column vector.
Geometric meaning: ∇f(x) points in the direction of steepest ascent of f at x. Its magnitude is the rate of change in that direction. The direction of steepest descent is -∇f(x). This is the entire reason gradient descent works.
For the example above:
Jacobian¶
For a vector-valued function f: R^n → R^m, the Jacobian is the matrix of partial derivatives:
Each row is the gradient of one output coordinate, treated as a row vector. The shape m × n is "output dim by input dim."
Special cases:
- If
m = 1(scalar output):J_fis a1 × nrow vector; the transpose of∇f(which isn × 1). - If
n = 1(scalar input):J_fis am × 1column vector; the derivative of each output w.r.t. the single input. - If
m = n: square Jacobian. If invertible, the inverse function theorem applies locally.
Linearity: the Jacobian is the linear approximation of f near x:
This is Taylor's theorem at first order. Every later piece of multivariate calculus is built on this.
Hessian¶
For f: R^n → R, the Hessian is the matrix of second partial derivatives:
H_f ∈ R^{n × n}, and it's symmetric (under standard regularity assumptions — Clairaut's theorem).
Geometric meaning: the Hessian describes the curvature of f at x. Eigenvalues:
- All positive → local minimum, "bowl" shape.
- All negative → local maximum.
- Mixed signs → saddle point.
- One zero → degenerate (flat in some direction).
Why we care for ML: the loss landscape's Hessian condition number (λ_max / λ_min) determines how fast gradient descent converges. Anisotropic Hessian (long thin valley) = slow GD. Adam approximates a diagonal preconditioner that makes the loss surface "rounder."
We will not compute Hessians explicitly in this curriculum — they cost O(n²) to store and O(n³) to invert, infeasible for million-parameter models. We will reason about them conceptually.
The notation jungle¶
Different sources use different conventions:
| Convention | ∇f (scalar output) |
J_f (vector output) |
Chain rule |
|---|---|---|---|
| Numerator layout (Wikipedia, Stanford CS) | 1 × n row |
m × n |
J_{fg} = J_f · J_g |
| Denominator layout (econometrics) | n × 1 column |
n × m |
J_{fg} = J_g · J_f |
This curriculum uses numerator layout (gradients/Jacobians arranged so the chain rule matmul reads naturally). PyTorch uses denominator-layout vectors internally but exposes numerator-layout gradients. Don't fight the convention; just pick one and stay.
In code:
grad = np.array([df_dx_i for i in range(n)])— shape(n,). NumPy 1-D vectors are conventionally treated as columns or rows interchangeably; in derivations, treat as columns.jacobian = np.empty((m, n))— shape(m, n), rowi= gradient off_i.
Worked: gradient of ||x||²¶
Let f(x) = ||x||² = x^T x = Σ x_i².
This is the simplest "the gradient is the function evaluated as a vector" rule. Used in L2 regularisation: ∇(λ ||θ||²) = 2λθ. Coefficients differ by factor-of-2 conventions; pin down which your code uses.
Worked: Jacobian of y = Wx + b¶
For W: (m, n), x: (n,), b: (m,), the output y ∈ R^m. Component-wise:
Jacobian w.r.t. x:
That's it. The Jacobian of a linear layer w.r.t. its input is the weight matrix itself. This is the single most-used Jacobian in all of deep learning.
Jacobian w.r.t. W is trickier — W has m × n entries, so the "Jacobian" of y: (m,) w.r.t. W: (m, n) is a rank-3 object of shape (m, m, n). Most autograd libraries express this implicitly: the chain rule produces a contribution ∂L/∂W = ∂L/∂y · x^T (outer product), without ever materialising the full rank-3 Jacobian.
We will derive this contribution explicitly in theory/02-chain-rule-and-backprop.md.
Worked: Jacobian of element-wise f(x)¶
For y = f(x) element-wise (so y_i = f(x_i) — e.g., f = relu, sigmoid, tanh):
The Jacobian is diagonal with entries f'(x_i). In code: don't materialise the full diagonal; just multiply element-wise.
For ReLU: J_ii = 1 if x_i > 0 else 0. For tanh: J_ii = 1 - tanh²(x_i). For sigmoid: J_ii = σ(x_i)(1 - σ(x_i)). Each shows up in Phase 7's autograd ops.
Drill problems¶
Solutions in solutions/01-derivatives-gradients-jacobians-ref.md at phase open.
- For
f(x, y) = x² + 3xy + y³: compute∇fandH_f. - For
f(x) = ||Ax - b||²whereA: (m, n), x: (n,), b: (m,): derive∇fandH_f. - Verify symmetry: for
f(x, y) = x² sin(y), check∂²f/∂x∂y = ∂²f/∂y∂x. - For
y = softmax(x)(lengthn): what's the shape ofJ_y(x)? Compute it forx = [0, 0, 0](n=3). - For a 2-layer MLP
y = W_2 · relu(W_1 x + b_1) + b_2with inputx: (n,), W_1: (h, n), W_2: (m, h): what's the shape ofJ_y(x)?
If two or more feel wobbly, re-read.
One-paragraph recap¶
The single-variable derivative generalises to the gradient (vector of partials, for scalar-output functions) and the Jacobian (matrix of partials, for vector-output functions). The Jacobian is the linear approximation of a function near a point: f(x + h) ≈ f(x) + J_f(x) h. The Hessian is the matrix of second derivatives and describes curvature; its eigenvalues classify critical points (min/max/saddle), and its condition number governs how fast first-order optimization (GD) converges. The most-used Jacobian in deep learning is for a linear layer y = Wx + b, whose Jacobian w.r.t. x is just W. Element-wise activations have diagonal Jacobians, which is why they're cheap. Every later derivation in Phase 4 — softmax, cross-entropy, chain rule — relies on this vocabulary.
Next: theory/02-chain-rule-and-backprop.md.