Skip to content

English · Español

01 — Derivatives, gradients, Jacobians

🇪🇸 La derivada en una variable es una pendiente; el gradiente en varias variables es un vector que apunta a la subida más rápida; el Jacobiano de una función vectorial es la matriz que generaliza ambos. Todo backprop opera sobre estos tres objetos.


Single-variable derivative

For f: R → R, the derivative at x is

\[ f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h} \]

Geometric: slope of the tangent line at x. Numeric: how much f changes per unit change in x, in the limit.

Useful derivatives to memorize cold (any later derivation goes faster if these are reflexive):

f(x) f'(x)
x^n n x^{n-1}
e^x e^x
log x 1/x
sin x cos x
cos x -sin x
tanh x 1 - tanh² x
relu x 1 if x > 0 else 0 (sub-gradient at 0)
sigmoid x sigmoid(x) (1 - sigmoid(x))
softplus x sigmoid x
1/x -1/x²

The last six get hand-derived in lab 00 as warm-ups for the softmax derivation.

Partial derivative

For f: R^n → R:

\[ \frac{\partial f}{\partial x_i} = \lim_{h \to 0} \frac{f(\ldots, x_i + h, \ldots) - f(\ldots, x_i, \ldots)}{h} \]

Treat all other coordinates as constants; differentiate as in the single-variable case.

Worked example. f(x, y) = x² y + sin(y). Then:

\[ \frac{\partial f}{\partial x} = 2 x y, \qquad \frac{\partial f}{\partial y} = x^2 + \cos(y) \]

Gradient

The gradient of f: R^n → R is the vector of partial derivatives:

\[ \nabla f(x) = \begin{pmatrix} \partial f / \partial x_1 \\ \vdots \\ \partial f / \partial x_n \end{pmatrix} \]

By convention a column vector.

Geometric meaning: ∇f(x) points in the direction of steepest ascent of f at x. Its magnitude is the rate of change in that direction. The direction of steepest descent is -∇f(x). This is the entire reason gradient descent works.

For the example above:

\[ \nabla f(x, y) = \begin{pmatrix} 2xy \\ x^2 + \cos y \end{pmatrix} \]

Jacobian

For a vector-valued function f: R^n → R^m, the Jacobian is the matrix of partial derivatives:

\[ J_f(x) = \begin{pmatrix} \partial f_1 / \partial x_1 & \cdots & \partial f_1 / \partial x_n \\ \vdots & \ddots & \vdots \\ \partial f_m / \partial x_1 & \cdots & \partial f_m / \partial x_n \end{pmatrix} \in \mathbb{R}^{m \times n} \]

Each row is the gradient of one output coordinate, treated as a row vector. The shape m × n is "output dim by input dim."

Special cases:

  • If m = 1 (scalar output): J_f is a 1 × n row vector; the transpose of ∇f (which is n × 1).
  • If n = 1 (scalar input): J_f is a m × 1 column vector; the derivative of each output w.r.t. the single input.
  • If m = n: square Jacobian. If invertible, the inverse function theorem applies locally.

Linearity: the Jacobian is the linear approximation of f near x:

\[ f(x + h) \approx f(x) + J_f(x) \cdot h \quad \text{(matrix-vector product)} \]

This is Taylor's theorem at first order. Every later piece of multivariate calculus is built on this.

Hessian

For f: R^n → R, the Hessian is the matrix of second partial derivatives:

\[ H_f(x)_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j} \]

H_f ∈ R^{n × n}, and it's symmetric (under standard regularity assumptions — Clairaut's theorem).

Geometric meaning: the Hessian describes the curvature of f at x. Eigenvalues:

  • All positive → local minimum, "bowl" shape.
  • All negative → local maximum.
  • Mixed signs → saddle point.
  • One zero → degenerate (flat in some direction).

Why we care for ML: the loss landscape's Hessian condition number (λ_max / λ_min) determines how fast gradient descent converges. Anisotropic Hessian (long thin valley) = slow GD. Adam approximates a diagonal preconditioner that makes the loss surface "rounder."

We will not compute Hessians explicitly in this curriculum — they cost O(n²) to store and O(n³) to invert, infeasible for million-parameter models. We will reason about them conceptually.

The notation jungle

Different sources use different conventions:

Convention ∇f (scalar output) J_f (vector output) Chain rule
Numerator layout (Wikipedia, Stanford CS) 1 × n row m × n J_{fg} = J_f · J_g
Denominator layout (econometrics) n × 1 column n × m J_{fg} = J_g · J_f

This curriculum uses numerator layout (gradients/Jacobians arranged so the chain rule matmul reads naturally). PyTorch uses denominator-layout vectors internally but exposes numerator-layout gradients. Don't fight the convention; just pick one and stay.

In code:

  • grad = np.array([df_dx_i for i in range(n)]) — shape (n,). NumPy 1-D vectors are conventionally treated as columns or rows interchangeably; in derivations, treat as columns.
  • jacobian = np.empty((m, n)) — shape (m, n), row i = gradient of f_i.

Worked: gradient of ||x||²

Let f(x) = ||x||² = x^T x = Σ x_i².

\[ \frac{\partial f}{\partial x_i} = 2 x_i \quad \Rightarrow \quad \nabla f = 2x \]

This is the simplest "the gradient is the function evaluated as a vector" rule. Used in L2 regularisation: ∇(λ ||θ||²) = 2λθ. Coefficients differ by factor-of-2 conventions; pin down which your code uses.

Worked: Jacobian of y = Wx + b

For W: (m, n), x: (n,), b: (m,), the output y ∈ R^m. Component-wise:

\[ y_i = \sum_{j} W_{ij} x_j + b_i \]

Jacobian w.r.t. x:

\[ \frac{\partial y_i}{\partial x_j} = W_{ij} \quad \Rightarrow \quad J_y(x) = W \]

That's it. The Jacobian of a linear layer w.r.t. its input is the weight matrix itself. This is the single most-used Jacobian in all of deep learning.

Jacobian w.r.t. W is trickier — W has m × n entries, so the "Jacobian" of y: (m,) w.r.t. W: (m, n) is a rank-3 object of shape (m, m, n). Most autograd libraries express this implicitly: the chain rule produces a contribution ∂L/∂W = ∂L/∂y · x^T (outer product), without ever materialising the full rank-3 Jacobian.

We will derive this contribution explicitly in theory/02-chain-rule-and-backprop.md.

Worked: Jacobian of element-wise f(x)

For y = f(x) element-wise (so y_i = f(x_i) — e.g., f = relu, sigmoid, tanh):

\[ \frac{\partial y_i}{\partial x_j} = f'(x_i) \cdot \delta_{ij} \]

The Jacobian is diagonal with entries f'(x_i). In code: don't materialise the full diagonal; just multiply element-wise.

For ReLU: J_ii = 1 if x_i > 0 else 0. For tanh: J_ii = 1 - tanh²(x_i). For sigmoid: J_ii = σ(x_i)(1 - σ(x_i)). Each shows up in Phase 7's autograd ops.

Drill problems

Solutions in solutions/01-derivatives-gradients-jacobians-ref.md at phase open.

  1. For f(x, y) = x² + 3xy + y³: compute ∇f and H_f.
  2. For f(x) = ||Ax - b||² where A: (m, n), x: (n,), b: (m,): derive ∇f and H_f.
  3. Verify symmetry: for f(x, y) = x² sin(y), check ∂²f/∂x∂y = ∂²f/∂y∂x.
  4. For y = softmax(x) (length n): what's the shape of J_y(x)? Compute it for x = [0, 0, 0] (n=3).
  5. For a 2-layer MLP y = W_2 · relu(W_1 x + b_1) + b_2 with input x: (n,), W_1: (h, n), W_2: (m, h): what's the shape of J_y(x)?

If two or more feel wobbly, re-read.

One-paragraph recap

The single-variable derivative generalises to the gradient (vector of partials, for scalar-output functions) and the Jacobian (matrix of partials, for vector-output functions). The Jacobian is the linear approximation of a function near a point: f(x + h) ≈ f(x) + J_f(x) h. The Hessian is the matrix of second derivatives and describes curvature; its eigenvalues classify critical points (min/max/saddle), and its condition number governs how fast first-order optimization (GD) converges. The most-used Jacobian in deep learning is for a linear layer y = Wx + b, whose Jacobian w.r.t. x is just W. Element-wise activations have diagonal Jacobians, which is why they're cheap. Every later derivation in Phase 4 — softmax, cross-entropy, chain rule — relies on this vocabulary.


Next: theory/02-chain-rule-and-backprop.md.