Skip to content

English · Español

Phase 8 — Quizzes

🇪🇸 Espejo legible de data/quizzes/phase-08-tensor-autograd.yaml. Centrado en los dos problemas nuevos de la fase: gradientes con broadcasting y derivadas de matmul/softmax-CE batcheado.

Source of truth: data/quizzes/phase-08-tensor-autograd.yaml.


q-08-01 — Broadcasting in backward

a + b with a.shape = (3,), b.shape = (4, 3). Upstream gradient is (4, 3). How does a receive a (3,)-shaped gradient?

  1. Reshape (4, 3) to (3,) by dropping the first axis.
  2. Sum the upstream gradient along the broadcast (replicated) axes (here axis 0) → (3,).
  3. Multiply by a one-hot of the broadcast axis.
  4. NumPy handles it automatically.
Answer **Choice 2.** Forward broadcasting is virtual replication along missing/length-1 axes. The gradient w.r.t. `a` sums the contributions from each replicated copy — so backward must `grad.sum(axis=...)` explicitly. NumPy does **not** do this for you.

q-08-02 — Matmul gradients (multi-choice)

C = A @ B, A: (M, K), B: (K, N), upstream dC: (M, N).

  1. dA = dC @ B.T
  2. dB = A.T @ dC
  3. dA = B @ dC.T
  4. dB = dC @ A
  5. dA = dC * B.T (elementwise)
Answer **Choices 1 and 2.** The matmul gradient is itself a matmul: `dA = dC B^T`, `dB = A^T dC`. Shapes confirm.

q-08-03 — Gradient of s = x.sum() (free)

x.shape = (3, 5), ds = 1. What is dx?

Answer `dx` has shape `(3, 5)` and every entry is `1` (i.e., `np.ones_like(x) * ds`). Reason: `∂s/∂x_ij = 1` for every `(i, j)`.

q-08-04 — Why central differences in gradcheck?

  1. Cheaper computationally.
  2. Truncation error O(ε²) (even-order Taylor terms cancel) vs O(ε) for forward differences.
  3. Forward differences need boundary derivative knowledge.
  4. Central is the only differentiable choice.
Answer **Choice 2.** Central cancels the leading `O(ε)` term; you get ~8 good digits at ε ≈ 1e-4 versus ~4 for forward.

q-08-05 — Batched softmax CE gradient (free)

z: (B, K), y: (B,). State dL/dz.

Answer `dL/dz = (p - one_hot(y, K)) / B` where `p = softmax(z)`. Same scalar formula `p - y`, just batched with the `/B` from the mean over the batch.