Skip to content

English · Español

02 — Dashboard metrics: math behind each panel

🇪🇸 Toda métrica del dashboard tiene una fórmula corta y un significado físico. Esta página es la referencia: derivada, no aproximada, con costo computacional explícito.


This file is the formula sheet. Every panel's number is computed by one of the formulas below. Each formula has a cost (in operations per training step) and a precision target (Welford-stable, log-stable, etc.).

Welford's running statistics (the foundation)

The naive way to compute the mean and standard deviation of a stream is to keep \(\sum x\) and \(\sum x^2\) and combine at the end. This loses precision for large streams.

Welford's update maintains a numerically stable running \((\mu, \sigma^2)\):

For each new sample \(x_n\):

\[ \mu_n = \mu_{n-1} + \frac{x_n - \mu_{n-1}}{n} \]
\[ M_n = M_{n-1} + (x_n - \mu_{n-1})(x_n - \mu_n) \]
\[ \sigma_n^2 = M_n / n \]

Storage: 3 scalars per stat (\(n, \mu, M\)). Update: 3 FLOPs per sample.

For dashboards: every per-layer activation magnitude, every gradient norm, every weight stat uses Welford. The hooks collect samples; Welford aggregates; at log time, the dashboard reads \((n, \mu, \sigma)\) and plots.

Panel 3: gradient norm

Global gradient \(\ell_2\) norm across all parameters:

\[ \|g\|_2 = \sqrt{\sum_{\theta} \|\nabla_\theta \mathcal{L}\|_F^2} \]

Cost: linear in parameter count. For MiniGPT (~103k params), cheap (microseconds).

Stored as a per-step value (not Welford-aggregated) because we want to see individual spikes, not summaries. 2000 steps × 8 bytes = 16 KB of state per run. Trivial.

Panel 4: per-layer activation magnitudes

For each layer \(\ell\) with output tensor \(A_\ell\) of shape \((B, L, d)\), capture:

\[ \bar A_\ell = \frac{1}{B \cdot L \cdot d} \sum |A_\ell| \]

This is the mean absolute activation. We use mean-absolute, not L2 norm, because it's easier to interpret across layers of different sizes.

Cost: one reduction over the tensor — \(O(B L d)\) FLOPs, < 0.1 ms per layer.

Welford over training steps. At plot time, draw the mean trajectory and (optionally) ±1σ band.

Panel 5: weight spectral norm via power iteration

The spectral norm \(\sigma_1(W)\) is the largest singular value of \(W\). Computing it via SVD is \(O(d^3)\), expensive for repeated calls. Power iteration approximates it:

Initialize \(v_0\) to a random unit vector. For \(k\) iterations:

\[ u_k = W v_{k-1} / \|W v_{k-1}\| \]
\[ v_k = W^T u_k / \|W^T u_k\| \]

After \(k\) iterations:

\[ \sigma_1(W) \approx \|W v_k\| \]

Convergence: exponential with rate \(\sigma_2 / \sigma_1\). For typical neural-network weight matrices, \(\sigma_2 / \sigma_1 \in [0.5, 0.9]\), so \(k = 10\) iterations give ~3 digits of accuracy. Good enough for a dashboard.

Cost: \(k \cdot 2 \cdot \text{numel}(W)\) FLOPs. For a (d, d) matrix with \(d = 64\), that's \(10 \cdot 2 \cdot 4096 = 82\,000\) FLOPs per call. Sub-microsecond.

Warm-starting: keep \(v\) from the previous step. The weights change slowly between steps, so the previous step's eigenvector is a great initial guess for this step. Cuts \(k\) from 10 to ~3.

Panel 6: dead-neuron detector

For an FFN hidden activation \(H \in \mathbb{R}^{B \times L \times d_{ff}}\), define:

\[ \text{dead}_j = \mathbb{1}\left[ \Pr_{b, l}\left( |H_{b,l,j}| < \epsilon \right) > 0.99 \right] \]

Count \(\sum_j \text{dead}_j\).

Default \(\epsilon = 10^{-3}\) (in activation space). Tune per-corpus if the FFN's natural scale is very different.

Cost: a thresholded count per call, \(O(B L d_{ff})\). Compute every \(K\) steps (every 50 steps suffices), not every step.

Dead attention head

For a head's attention weights \(\alpha \in \mathbb{R}^{B \times L \times L}\) (after softmax), entropy per query:

\[ H_q(\alpha_{b, l}) = -\sum_{l'} \alpha_{b, l, l'} \log \alpha_{b, l, l'} \]

A query distribution concentrated on one key has \(H \approx 0\); uniform has \(H = \log L\).

Define the head as "dead" if \(\Pr_{b, l}\left( H_q < \log 2 \right) > 0.99\) — i.e., almost every query attends to effectively one key.

Cost: \(O(B L^2)\) per head. Compute every 50 steps.

Panel 1: per-token mean loss

Standard cross-entropy reduction from Phase 18:

\[ \bar{\mathcal{L}} = \frac{1}{\sum_{b, l} M_{b, l}} \sum_{b, l} M_{b, l} \cdot \mathcal{L}_{b, l} \]

Train loss: per-step scalar. Val loss: per-val-step scalar (over the whole val set).

Cost: one reduction per batch (already computed for the optimizer step).

Train/val divergence step

\[ t^* = \arg\min_t \left( \mathcal{L}^{\text{val}}(t) - \mathcal{L}^{\text{train}}(t) \right) \]

The step after which the gap consistently widens — "overfitting onset". Computed post-hoc from the loss history; not a streaming metric.

Panel 7: per-class loss decomposition

The §A13-specific panel. The corpus partitions verbs into:

  • \(R = \{\) work, play, walk, talk, listen, watch, study, finish, start, look, want, like \(\}\) (12 regulars)
  • \(I = \{\) be, have, do, go, come, see, eat, write \(\}\) (8 irregulars)

For each training batch, partition the examples by their verb's class. Compute:

\[ \bar{\mathcal{L}}^{\text{reg}}_t = \frac{1}{|B_R|} \sum_{i \in B_R} \mathcal{L}_i, \quad \bar{\mathcal{L}}^{\text{irr}}_t = \frac{1}{|B_I|} \sum_{i \in B_I} \mathcal{L}_i \]

where \(B_R\) is the subset of the batch with regular-verb examples (similarly \(B_I\)).

When the batch has zero examples of one class (it can happen for small batches), don't update that class's stat for this step.

Welford-aggregate across logging windows. At plot time, draw two lines.

Cost: per-batch, \(O(B)\) to partition and reduce twice. Trivial.

The irregular-verb tax

Define:

\[ \tau_t = \bar{\mathcal{L}}^{\text{irr}}_t - \bar{\mathcal{L}}^{\text{reg}}_t \]

A positive \(\tau\) means irregulars are harder. A near-zero \(\tau\) at convergence means the model has memorized; a non-zero \(\tau\) at convergence means there's a real difficulty gap (which is the correct outcome on a small model with limited capacity for irregulars).

Track \(\tau_t\) over the run. The expected curve: \(\tau\) starts near 0 (random model loses equally on both), grows to ~0.5-1.0 nats by step 500 (model learns the regular rule fast, irregulars lag), then narrows to ~0.2-0.5 nats by end.

Loss-spike detector

For a window of \(W\) recent losses, the median and median-absolute-deviation:

\[ \tilde{\mathcal{L}}_t = \text{median}(\mathcal{L}_{t-W:t-1}) \]
\[ \text{MAD}_t = \text{median}(|\mathcal{L}_{t-W:t-1} - \tilde{\mathcal{L}}_t|) \]

Spike flag at step \(t\):

\[ |\mathcal{L}_t - \tilde{\mathcal{L}}_t| > 5 \cdot \text{MAD}_t \]

We use median + MAD instead of mean + std because the early steps have wild variance that would inflate std and suppress real spikes later.

Suppress the detector during warmup (t < W). \(W = 50\) is a reasonable default.

Cost: \(O(W \log W)\) per step. For \(W = 50\), trivial.

Layer-wise grad-to-weight ratio (diagnostic, not always plotted)

For each layer's weight matrix \(W_\ell\) and its gradient \(G_\ell\):

\[ \rho_\ell = \frac{\|G_\ell\|_F}{\|W_\ell\|_F + 10^{-12}} \]

Healthy: \(\rho_\ell \in [10^{-4}, 10^{-1}]\). Below \(10^{-4}\) means the layer is barely training (vanishing grad). Above \(10^{-1}\) means the update size is dangerously large relative to the parameter scale (instability).

Stored as a per-step-per-layer scalar; plotted occasionally, not by default in Panel 4-5.

Putting it all together: cost budget

Per training step, the diagnostic overhead is:

Metric Cost per step
Welford updates (all stats) ~50 μs
Global gradient norm ~100 μs
Per-layer activation mean ~200 μs
Power iteration (spectral norm × 4 matrices, warm-started) ~80 μs
Dead neuron detection (every 50 steps, amortized) ~20 μs amortized
Per-class loss decomposition ~10 μs
Total per step ~460 μs

A Phase-18 training step on the i5-8250U is ~30 ms (mostly Python/NumPy overhead). The diagnostic overhead is 460/30000 ≈ 1.5%. Well under the 30% budget.

If the actual lab measurement shows >30%, the implementation has a hot-loop bug — most commonly, calling Welford in a Python loop over batch elements instead of doing a vectorized reduction.

Drill problems

  1. The mean activation at block_1_out is 12.0 at step 0, growing to 18.0 by step 100. Healthy?
  2. The spectral norm of the attention QKV matrix at layer 2 grows from 1.2 at init to 14.0 by step 500. Concern level?
  3. The loss_regular at step 500 is 1.8; loss_irregular is 3.4. What's \(\tau\)? Is it a worrying value, an expected value, or a target value?
  4. Welford's update of \((\mu, M)\) requires three FLOPs. Show your work for the formula \(\mu_n = \mu_{n-1} + (x_n - \mu_{n-1}) / n\) — what numerical issue does this prevent vs naive \(\mu_n = (\sum_{i=1}^n x_i) / n\)?

One-paragraph recap

Every dashboard metric is computed by a simple formula with explicit cost. Welford's algorithm handles all running mean/std accumulation. Spectral norm uses warm-started power iteration with 3-10 steps, sub-millisecond. Dead-neuron detection is a thresholded reduction every 50 steps. The §A13-specific per-class loss decomposes the batch into regular and irregular subsets and tracks the irregular-verb tax \(\tau = \mathcal{L}^{\text{irr}} - \mathcal{L}^{\text{reg}}\) over training. Total diagnostic overhead is ~460 μs/step, ~1.5% of the training step on this hardware — comfortably inside the 30% budget.

What this section does NOT cover

  • The Welford derivation in full (covered in Phase 2 or 6 — Knuth's TAOCP).
  • Numerical conditioning of power iteration when \(\sigma_1 \approx \sigma_2\) (rare for NN weights).
  • Bayesian alternatives to per-class loss decomposition (out of scope).

Next: theory/03-three-failure-modes.md.