English · Español

03 — Drift Detection: KL and PSI on verb distributions¶

🇪🇸 Dos métricas, dos escalas. La divergencia KL sobre histogramas de formas verbales cuantifica cuán diferentes son las distribuciones; PSI sobre features escalares (longitud, tasa de rechazos, ratio persona/tiempo) es la versión bucketizada con umbrales fijos. Ambas exigen tamaño mínimo de muestra — un KL sobre 50 tokens es ruido, no señal. Para el corpus microscópico (20 verbos × 5 tiempos × 3 personas), el espacio de tokens es lo bastante pequeño como para que la bucketización por frecuencia entrenada importe más que el conteo bruto.

What "drift" means here¶

The grammar tutor is trained on a distribution \(P\) of inputs — English sentences using the 20 verbs in the 5 tenses × 3 persons grid from §A13. In production, learners submit sentences from a distribution \(Q\). If \(P \approx Q\), the tutor's Phase 20 eval-time performance carries over. If \(Q\) has shifted away from \(P\) — say, learners submit sentences with verbs we didn't train on, or use the past participle in contexts we didn't see in training — the eval is no longer predictive, and the tutor may silently degrade.

Two flavors of drift matter:

Input drift. \(Q\)'s verbs, tenses, persons, sentence lengths, or syntactic patterns have shifted. Detectable from request logs alone, without ground truth.
Output drift. Even with \(P = Q\), the tutor's own outputs (refusal rate, confidence, response length) have shifted. Usually a symptom of a deploy or environmental change, not data drift.

Phase 38 focuses on input drift. Output drift is easier — Phase 34's observability stack flags it directly via metric panels.

The setup¶

We have two distributions over a finite alphabet \(V\) (the tokenizer vocabulary, \(|V|\) depending on Phase 11's BPE settings — for the English verb corpus, likely \(|V| \approx 2{,}000\), much smaller than a general-purpose tokenizer):

\(P\): the training-time token frequency, computed once at training and stored in the registry alongside eval_report.json (as train_token_histogram.json).
\(Q\): the live-traffic token frequency, computed over a sliding window (e.g., last 24 hours of inputs).

We want a single number that measures "how different are \(P\) and \(Q\)?". Two cheap, well-understood options: KL divergence and PSI.

KL divergence¶

\[D_{KL}(P \| Q) = \sum_{t \in V} P(t) \log \frac{P(t)}{Q(t)}\]

Properties:

\(D_{KL}(P \| Q) \geq 0\), with equality iff \(P = Q\).
Not symmetric. \(D_{KL}(P \| Q) \neq D_{KL}(Q \| P)\) in general. The convention "training distribution first" matters.
Unbounded. If \(Q(t) = 0\) for some \(t\) with \(P(t) > 0\), the term blows up to infinity. Smoothing is mandatory.
Units. With \(\log\) as natural log, KL is in nats. With \(\log_2\), it's bits. Pick one and stick to it; we use natural log.

Why \(P\) in the front, not \(Q\)¶

The choice "training first" is convention plus practical reason: \(P\) is known and stable (the training corpus is fixed); \(Q\) is estimated from noisy live samples. Putting \(P\) in the denominator would make the formula sensitive to zero-frequency tokens in the estimate of \(Q\) (which happens all the time for rare tokens in small windows). Putting \(P\) in the front means we sum over the support of \(P\) — tokens we expect to see — and weight by training frequency. Drift in rare tokens contributes little; drift in common tokens contributes a lot. This is usually what we want.

For the grammar tutor specifically: \(P\) has near-uniform mass over the 20 verbs (the Phase 12 corpus enumerates the grid). Common BPE merges like -ed, -s, -ing dominate by surface frequency. Drift on these high-frequency tokens (e.g., learners use -ing 3× as often as we trained on) shows up strongly; drift on rare tokens (a verb the user spelled wrong) shows up weakly. This bias is appropriate.

Smoothing¶

Both distributions need smoothing to avoid \(\log 0\):

\[P_{\text{smooth}}(t) = \frac{c_P(t) + \alpha}{N_P + \alpha |V|}, \quad \alpha = 10^{-6}\]

where \(c_P(t)\) is the count of token \(t\) in the training corpus and \(N_P = \sum_t c_P(t)\) is the total. Same for \(Q\). With \(\alpha = 10^{-6}\), smoothing barely perturbs high-frequency tokens while keeping low-frequency tokens finite.

Thresholds¶

KL has no universal threshold — it depends on \(|V|\), smoothing, and what counts as "significant" drift in the domain. For the verb corpus with \(|V| \approx 2{,}000\):

\(D_{KL} < 0.03\) → quiet. Same distribution.
\(0.03 \leq D_{KL} < 0.15\) → notable. Investigate.
\(D_{KL} \geq 0.15\) → significant. Likely retraining warranted.

These come from empirical calibration in lab 02 — Borja will perturb the corpus and measure KL at known perturbation levels (e.g., inject past-participle forms into a present-simple-dominated stream).

Sample-size lower bound¶

A KL computed from \(N_Q\) live tokens has a sampling error of \(O(|V| / N_Q)\) (Miller-Madow correction for finite-sample bias). For \(|V| = 2{,}000\) and a "trustworthy" KL within 0.01 absolute, we need \(N_Q \gtrsim 2 \times 10^5\) tokens.

Lab 02 sets the minimum window to 1,000 tokens for the synthetic-shift demo. In production, the window should be tuned to corpus and traffic — this is a tuning knob, not a constant.

PSI: Population Stability Index¶

PSI is KL divergence, bucketized, with a fixed interpretation table. Used heavily in finance (credit-scoring) since the 90s. The formula:

\[\text{PSI} = \sum_{i=1}^{B} (Q_i - P_i) \log \frac{Q_i}{P_i}\]

over \(B\) bins. Each \(P_i\) is the fraction of training samples in bin \(i\); \(Q_i\) is the fraction of live samples in bin \(i\).

Relationship to KL¶

\[\text{PSI} = D_{KL}(Q \| P) + D_{KL}(P \| Q)\]

PSI is the symmetrized KL. This makes thresholds direction-agnostic — handy for monitoring where you don't want to commit to "drift towards" vs "drift away from".

The canonical thresholds¶

Unlike raw KL, PSI has industry-standard thresholds, applicable across domains:

\(\text{PSI} < 0.10\) → no significant shift.
\(0.10 \leq \text{PSI} < 0.25\) → moderate shift. Monitor.
\(\text{PSI} \geq 0.25\) → significant shift. Investigate, likely retrain.

These thresholds work because PSI is on bins (typically 10), not the full vocabulary. The bin count caps the magnitude.

What goes in the bins¶

For scalar features (sentence length, refusal rate, request rate, response length, person/tense ratios): equal-frequency bins of the training distribution. Ten bins is standard.

For the verb-token distribution: PSI over a frequency-stratified bucketing — group tokens by training frequency. For the grammar tutor's small vocabulary, three frequency strata work:

Top 20 most-frequent tokens (function words, common BPE merges) → 5 buckets of 4 tokens each.
Middle 200 tokens (verb stems, suffixes, punctuation) → 5 buckets of 40 tokens each.
Tail (the remaining ~1,800 tokens) → 5 buckets of ~360 tokens each.

A 15-bucket frequency-stratified PSI handles the verb-corpus shape better than equal-width.

When to prefer PSI over KL¶

Cross-team communication. "PSI = 0.31" is interpretable to ops engineers; "KL = 0.18 nats" is not.
Multiple features. PSI scores can be aggregated across features (verb frequency, person/tense distribution, sentence length, ...). KL can too but the thresholds vary.
Small samples. PSI's bucketing makes it more robust to sample noise — especially relevant for the small verb corpus.

When to prefer KL over PSI¶

High-resolution monitoring. PSI's binning loses information about which token shifted.
Comparing to a theoretical distribution. KL is the natural information-theoretic object; PSI is a heuristic.

For the grammar tutor, we compute both and report both. PSI is the alert threshold (operability); KL is the diagnostic (understanding).

Per-bucket PSI for the verb-tense grid¶

A grammar tutor has a natural feature decomposition: the (tense × person) grid. PSI computed over the 5×3 = 15 grid cells (with training frequencies as the reference distribution) gives a targeted drift signal:

PSI_grid = sum over (tense, person) of (Q_cell - P_cell) * log(Q_cell / P_cell)

If learners shift toward submitting past-participle sentences (because Phase 32 emphasized them, say), the past-participle row of the grid spikes in \(Q\) while staying flat in \(P\). PSI_grid catches this even when the raw token PSI is quiet.

This is the kind of domain-aware drift metric that pays for itself relative to a generic ML drift dashboard.

What drift detection does not do¶

It does not tell you the tutor has gotten worse. Drift is a precondition for degradation, not a measurement of it. The tutor might handle the shifted distribution fine; only an eval on the shifted distribution can tell you.
It does not point at a root cause. PSI on the tense grid spiked. Why? Could be a new user cohort, a corrupted upstream, a tokenization bug, an attack. Drift is the alarm, not the diagnosis.
It does not work on tiny live windows. For 50 tokens of live traffic, the math still runs but the result is sampling noise. The 1,000-token floor is a hard rule, not a guideline.

Detection workflow¶

The Phase 38 drift workflow is a scheduled batch (default daily):

nightly:
  P          := train_token_histogram   (from the active registry entry)
  P_grid     := train_grid_histogram    (per (tense, person))
  Q          := tokenize(last_24h_inputs)
  Q_grid     := classify(last_24h_inputs) into (tense, person) cells
  if |Q| < 1000: skip, log "insufficient samples"
  compute:
    kl       = KL(P || Q)
    psi_tok  = PSI(P_buckets, Q_buckets)
    psi_grid = PSI(P_grid,    Q_grid)
  for each scalar feature f in {sentence_length, refusal_rate, request_rate}:
    psi_f = PSI(P_f_bins, Q_f_bins)
  emit drift_report/YYYY-MM-DD.json
  if any psi >= 0.25: alert (write to alerts.jsonl + log structured event)

The drift report is committed to the experiments tree, not to a separate database. Append-only files are the audit substrate.

Where this code lives¶

scripts/mlops/drift.py, ~120 LOC. Pure NumPy + stdlib. The driver script (scripts/mlops/run_drift_check.py) is invoked by just drift-check and by a daily GitHub Action.

The drift module does not live in src/miniobserve/ — miniobserve is for online metrics streamed from the serving stack. Drift is offline batch analysis over logged traces. Keeping them separate avoids coupling the online hot path to the batch analysis logic.

Drill problems (work these before lab 02)¶

Solutions in solutions/03-drift-ref.md — written at phase open.

Show that \(D_{KL}(P \| Q) \geq 0\) using Jensen's inequality on \(\log\).
The verb-token histogram \(P\) has 2,000 entries; an attacker submits 100 requests each containing a single rare token never seen at training. Without smoothing, KL = \(\infty\). Compute the smoothed KL with \(\alpha = 10^{-6}\), \(|V| = 2{,}000\), \(N_P = 10^7\).
PSI on sentence-length: training has \(E[L] = 8\) words, \(\sigma_L = 3\); live traffic has \(E[L] = 8\), \(\sigma_L = 12\) (same mean, much fatter tails). Is PSI sensitive to this? Sketch the bins and the answer.
The grid PSI on (tense, person) is 0.18. Token PSI is 0.05. What does this combination mean operationally? Which intervention would you recommend?

One-paragraph recap¶

Drift is the gap between the training input distribution and the live input distribution. KL divergence quantifies it with information-theoretic units; PSI is its bucketized industry cousin with fixed thresholds (0.10, 0.25). Smoothing is mandatory to avoid log-of-zero; sample-size minimums are mandatory to distinguish signal from noise. For the grammar tutor, the grid-PSI (over the tense × person cells) is the most informative single number — it catches shifts the raw token PSI smooths over. Drift detection is an alarm, not a diagnosis: a PSI spike says "look", not "the tutor is broken".

Next: theory/04-traffic-and-finops.md.