English · Español
02 — Dashboard metrics: math behind each panel¶
🇪🇸 Toda métrica del dashboard tiene una fórmula corta y un significado físico. Esta página es la referencia: derivada, no aproximada, con costo computacional explícito.
This file is the formula sheet. Every panel's number is computed by one of the formulas below. Each formula has a cost (in operations per training step) and a precision target (Welford-stable, log-stable, etc.).
Welford's running statistics (the foundation)¶
The naive way to compute the mean and standard deviation of a stream is to keep \(\sum x\) and \(\sum x^2\) and combine at the end. This loses precision for large streams.
Welford's update maintains a numerically stable running \((\mu, \sigma^2)\):
For each new sample \(x_n\):
Storage: 3 scalars per stat (\(n, \mu, M\)). Update: 3 FLOPs per sample.
For dashboards: every per-layer activation magnitude, every gradient norm, every weight stat uses Welford. The hooks collect samples; Welford aggregates; at log time, the dashboard reads \((n, \mu, \sigma)\) and plots.
Panel 3: gradient norm¶
Global gradient \(\ell_2\) norm across all parameters:
Cost: linear in parameter count. For MiniGPT (~103k params), cheap (microseconds).
Stored as a per-step value (not Welford-aggregated) because we want to see individual spikes, not summaries. 2000 steps × 8 bytes = 16 KB of state per run. Trivial.
Panel 4: per-layer activation magnitudes¶
For each layer \(\ell\) with output tensor \(A_\ell\) of shape \((B, L, d)\), capture:
This is the mean absolute activation. We use mean-absolute, not L2 norm, because it's easier to interpret across layers of different sizes.
Cost: one reduction over the tensor — \(O(B L d)\) FLOPs, < 0.1 ms per layer.
Welford over training steps. At plot time, draw the mean trajectory and (optionally) ±1σ band.
Panel 5: weight spectral norm via power iteration¶
The spectral norm \(\sigma_1(W)\) is the largest singular value of \(W\). Computing it via SVD is \(O(d^3)\), expensive for repeated calls. Power iteration approximates it:
Initialize \(v_0\) to a random unit vector. For \(k\) iterations:
After \(k\) iterations:
Convergence: exponential with rate \(\sigma_2 / \sigma_1\). For typical neural-network weight matrices, \(\sigma_2 / \sigma_1 \in [0.5, 0.9]\), so \(k = 10\) iterations give ~3 digits of accuracy. Good enough for a dashboard.
Cost: \(k \cdot 2 \cdot \text{numel}(W)\) FLOPs. For a (d, d) matrix with \(d = 64\), that's \(10 \cdot 2 \cdot 4096 = 82\,000\) FLOPs per call. Sub-microsecond.
Warm-starting: keep \(v\) from the previous step. The weights change slowly between steps, so the previous step's eigenvector is a great initial guess for this step. Cuts \(k\) from 10 to ~3.
Panel 6: dead-neuron detector¶
For an FFN hidden activation \(H \in \mathbb{R}^{B \times L \times d_{ff}}\), define:
Count \(\sum_j \text{dead}_j\).
Default \(\epsilon = 10^{-3}\) (in activation space). Tune per-corpus if the FFN's natural scale is very different.
Cost: a thresholded count per call, \(O(B L d_{ff})\). Compute every \(K\) steps (every 50 steps suffices), not every step.
Dead attention head¶
For a head's attention weights \(\alpha \in \mathbb{R}^{B \times L \times L}\) (after softmax), entropy per query:
A query distribution concentrated on one key has \(H \approx 0\); uniform has \(H = \log L\).
Define the head as "dead" if \(\Pr_{b, l}\left( H_q < \log 2 \right) > 0.99\) — i.e., almost every query attends to effectively one key.
Cost: \(O(B L^2)\) per head. Compute every 50 steps.
Panel 1: per-token mean loss¶
Standard cross-entropy reduction from Phase 18:
Train loss: per-step scalar. Val loss: per-val-step scalar (over the whole val set).
Cost: one reduction per batch (already computed for the optimizer step).
Train/val divergence step¶
The step after which the gap consistently widens — "overfitting onset". Computed post-hoc from the loss history; not a streaming metric.
Panel 7: per-class loss decomposition¶
The §A13-specific panel. The corpus partitions verbs into:
- \(R = \{\)
work, play, walk, talk, listen, watch, study, finish, start, look, want, like\(\}\) (12 regulars) - \(I = \{\)
be, have, do, go, come, see, eat, write\(\}\) (8 irregulars)
For each training batch, partition the examples by their verb's class. Compute:
where \(B_R\) is the subset of the batch with regular-verb examples (similarly \(B_I\)).
When the batch has zero examples of one class (it can happen for small batches), don't update that class's stat for this step.
Welford-aggregate across logging windows. At plot time, draw two lines.
Cost: per-batch, \(O(B)\) to partition and reduce twice. Trivial.
The irregular-verb tax¶
Define:
A positive \(\tau\) means irregulars are harder. A near-zero \(\tau\) at convergence means the model has memorized; a non-zero \(\tau\) at convergence means there's a real difficulty gap (which is the correct outcome on a small model with limited capacity for irregulars).
Track \(\tau_t\) over the run. The expected curve: \(\tau\) starts near 0 (random model loses equally on both), grows to ~0.5-1.0 nats by step 500 (model learns the regular rule fast, irregulars lag), then narrows to ~0.2-0.5 nats by end.
Loss-spike detector¶
For a window of \(W\) recent losses, the median and median-absolute-deviation:
Spike flag at step \(t\):
We use median + MAD instead of mean + std because the early steps have wild variance that would inflate std and suppress real spikes later.
Suppress the detector during warmup (t < W). \(W = 50\) is a reasonable default.
Cost: \(O(W \log W)\) per step. For \(W = 50\), trivial.
Layer-wise grad-to-weight ratio (diagnostic, not always plotted)¶
For each layer's weight matrix \(W_\ell\) and its gradient \(G_\ell\):
Healthy: \(\rho_\ell \in [10^{-4}, 10^{-1}]\). Below \(10^{-4}\) means the layer is barely training (vanishing grad). Above \(10^{-1}\) means the update size is dangerously large relative to the parameter scale (instability).
Stored as a per-step-per-layer scalar; plotted occasionally, not by default in Panel 4-5.
Putting it all together: cost budget¶
Per training step, the diagnostic overhead is:
| Metric | Cost per step |
|---|---|
| Welford updates (all stats) | ~50 μs |
| Global gradient norm | ~100 μs |
| Per-layer activation mean | ~200 μs |
| Power iteration (spectral norm × 4 matrices, warm-started) | ~80 μs |
| Dead neuron detection (every 50 steps, amortized) | ~20 μs amortized |
| Per-class loss decomposition | ~10 μs |
| Total per step | ~460 μs |
A Phase-18 training step on the i5-8250U is ~30 ms (mostly Python/NumPy overhead). The diagnostic overhead is 460/30000 ≈ 1.5%. Well under the 30% budget.
If the actual lab measurement shows >30%, the implementation has a hot-loop bug — most commonly, calling Welford in a Python loop over batch elements instead of doing a vectorized reduction.
Drill problems¶
- The mean activation at
block_1_outis 12.0 at step 0, growing to 18.0 by step 100. Healthy? - The spectral norm of the attention QKV matrix at layer 2 grows from 1.2 at init to 14.0 by step 500. Concern level?
- The
loss_regularat step 500 is 1.8;loss_irregularis 3.4. What's \(\tau\)? Is it a worrying value, an expected value, or a target value? - Welford's update of \((\mu, M)\) requires three FLOPs. Show your work for the formula \(\mu_n = \mu_{n-1} + (x_n - \mu_{n-1}) / n\) — what numerical issue does this prevent vs naive \(\mu_n = (\sum_{i=1}^n x_i) / n\)?
One-paragraph recap¶
Every dashboard metric is computed by a simple formula with explicit cost. Welford's algorithm handles all running mean/std accumulation. Spectral norm uses warm-started power iteration with 3-10 steps, sub-millisecond. Dead-neuron detection is a thresholded reduction every 50 steps. The §A13-specific per-class loss decomposes the batch into regular and irregular subsets and tracks the irregular-verb tax \(\tau = \mathcal{L}^{\text{irr}} - \mathcal{L}^{\text{reg}}\) over training. Total diagnostic overhead is ~460 μs/step, ~1.5% of the training step on this hardware — comfortably inside the 30% budget.
What this section does NOT cover¶
- The Welford derivation in full (covered in Phase 2 or 6 — Knuth's TAOCP).
- Numerical conditioning of power iteration when \(\sigma_1 \approx \sigma_2\) (rare for NN weights).
- Bayesian alternatives to per-class loss decomposition (out of scope).
Next: theory/03-three-failure-modes.md.