Skip to content

English · Español

04 — A/B Significance and Cost-per-Quality

🇪🇸 Dos cuentas pequeñas, ambas cruciales. Un test z de dos proporciones decide si una diferencia observada entre A y B (tasa de conjugaciones correctas) es real. El coste por unidad de calidad (CpQU) decide si una mejora vale lo que cuesta. Sin estos dos números, "mejor" y "más caro" se confunden y se promociona la versión equivocada.


Part 1: A/B significance

When an A/B test reports "tutor B has 3pp higher conjugation accuracy than A", the natural follow-up is: "is that real, or noise?" The answer is the two-proportion z-test.

The test

Setup: - \(n_A\) requests routed to A, of which \(x_A\) produced the positive outcome (correct conjugation, judged against the canonical Phase 20 table). - \(n_B\) requests routed to B, of which \(x_B\) produced the positive outcome. - \(\hat{p}_A = x_A / n_A\), \(\hat{p}_B = x_B / n_B\). - Under \(H_0: p_A = p_B\), the pooled proportion is \(\hat{p} = (x_A + x_B) / (n_A + n_B)\).

The z-statistic:

\[z = \frac{\hat{p}_A - \hat{p}_B}{\sqrt{\hat{p}(1 - \hat{p}) \left(\frac{1}{n_A} + \frac{1}{n_B}\right)}}\]

Under \(H_0\), \(z \sim \mathcal{N}(0, 1)\). Reject at \(\alpha = 0.05\) (two-tailed) if \(|z| > 1.96\).

Sample size

The minimum detectable effect (MDE) at 80% power, \(\alpha = 0.05\):

\[n_{\min} \approx 2 \left(\frac{z_{\alpha/2} + z_{\beta}}{\delta}\right)^2 p(1-p)\]

where \(\delta\) is the absolute difference you want to detect, \(p\) is the baseline rate, \(z_{\alpha/2} = 1.96\), \(z_\beta = 0.84\). For \(p = 0.75\) (typical Phase 20 grammar-tutor accuracy) and \(\delta = 0.03\) (3pp improvement), \(n_{\min} \approx 1{,}640\) per arm.

For the microscopic verb corpus specifically: the total Phase 20 eval set has on the order of 600 forms (20 verbs × 5 tenses × 3 persons × 2 voices/aspects/etc.). At 1,640 samples per arm, you'd have to repeat each form ~3 times on average — which is fine for grader-deterministic tasks like conjugation, but inflates the noise floor for any user-judgment metric. Sub-1k-sample A/Bs on conjugation accuracy with this corpus are noise theater.

Why z, not chi-square or Fisher's exact

The z-test for two proportions, the chi-square test, and Fisher's exact test are equivalent for 2×2 tables in the large-sample limit. We use z because: (a) the signed statistic carries direction, (b) confidence intervals fall out directly, © the formula is interpretable.

For very small samples (\(n_A\) or \(n_B\) < 30), Fisher's exact is correct. We never have small samples in production A/B, so we ignore that regime.

What the test does not give you

  • Operational significance. A z = 3.2 (p < 0.005) on a 0.3pp difference is statistically significant but operationally invisible. Always report effect size and CI alongside the test. For the grammar tutor: a 0.3pp gain on conjugation accuracy is well below the 2pp CI guard band of eval_baseline.json; promoting on it would be noise.
  • Multiple-comparisons correction. If you ran 10 A/Bs simultaneously (say, one per tense), you'll get a false positive ~half the time at \(\alpha = 0.05\). Bonferroni at \(\alpha / k\) is the conservative fix.
  • Sequential testing safety. Repeatedly checking the test during the experiment ("can we stop yet?") inflates false-positive rates. Either pre-commit to a sample size or use a proper sequential method (Wald's SPRT, mSPRT). Phase 38 just pre-commits.

Quality A/Bs vs latency A/Bs for the tutor

The grammar tutor's correctness signal is deterministic (compare against canonical conjugation table). We do not run online correctness A/Bs — instead, we capture both A's and B's corrections from shadow traffic and grade them offline against Phase 20's labels. The z-test then runs on the offline grading.

This is why shadow + offline-eval is the primary validation, not online A/B (theory/02). The exception: latency A/Bs are run online (latency doesn't require ground truth — the wall clock is the ground truth).

Part 2: Cost-per-Quality

The "is it worth it?" question has a clean form:

\[\text{CpQU} = \frac{\text{cost per 1k tokens}}{\text{quality score}}\]

Two definitions of "quality score" matter for the grammar tutor:

  • Conjugation accuracy — from Phase 20's eval harness. The fraction of (verb, tense, person) triples for which the tutor's predicted form matches the canonical table.
  • Pass-rate-at-1 — a tighter binary: did the tutor's first proposed correction match the canonical form exactly (no second-guess required).

The "cost per 1k tokens" is:

\[\text{cost\_per\_1k} = \frac{\text{rate}_\$\text{/hr}}{\text{tps} \cdot 3.6}\]

where \(\text{tps}\) is the tokens-per-second throughput on the serving hardware and 3.6 comes from \(\text{tokens/sec} \cdot 3600\,\text{sec/hr} / 1000\,\text{tokens/k} = 3.6\). For CPU-only Phase 22+ paths, the rate is the amortized hardware + electricity; for Phase 23+ cloud paths it's the actual rental.

Lower is better

CpQU is dollars per unit of quality. A tutor with 75% conjugation accuracy at $0.012/1k tokens has CpQU = $0.016 per accuracy point (multiplied by 100 to keep numbers readable, or normalized differently — units matter; pick a convention and stick to it). A tutor with 78% accuracy at $0.030/1k tokens has CpQU = $0.038 per accuracy point. The second is "better" but 2.4× more expensive per unit quality. Whether the 3pp improvement is worth that depends on the application — but the CpQU surfaces the question.

What CpQU does not encode

  • Latency. Two tutors with identical CpQU might have wildly different p99 latency. Latency lives in the serving metrics, not the cost ratio.
  • Risk. A tutor with marginally better CpQU but a higher refusal-rate variance is operationally riskier. Risk lives in Phase 37 / Phase 40 territory.
  • Per-bucket quality. A tutor whose CpQU is good on average but terrible on past-participle conjugations might be worse than a uniformly-mediocre tutor for a learner who's specifically studying past participles. Aggregate CpQU is a starting point, not the whole picture.
  • Operator preference. Some teams prefer the cheaper model even at modest quality cost. The CpQU number doesn't make the decision; it informs it.

The CpQU table

The deliverable for Phase 38 lab 03 is docs/COSTS.md containing one row per registry entry:

SHA (8) Semver Conjugation accuracy $/1k tokens CpQU Notes
a1b2c3d4 v0.1.0 0.752 0.0098 0.0130 FP32 baseline (Phase 18)
e5f6a7b8 v0.2.0 0.748 0.0042 0.0056 INT8 (Phase 26), 0.4pp drop
c9d0e1f2 v0.3.0 0.781 0.0114 0.0146 LoRA grammar tutor (Phase 28)

Three rows, three deployment decisions made obvious. The INT8 variant is operationally cheaper per quality point; the LoRA improves quality but at a price. (Numbers above are illustrative — Borja's actual numbers come from his measurements.)

Cost discipline (recap from §5.5)

LYNX_CORTEX.md §5.5 already requires every GPU-touching script to print estimated cost. Phase 38 lifts this from per-script to per-registry-entry: every registered model has its CpQU computed once at promotion time and stored.

For the CPU-only laptop deployment, the rate is the notional $0.05/hr (matches Phase 34). The CpQU column is internally comparable even if the absolute dollar number is fictional — promote based on rank, not on absolute spend.

Combining the two

A shadow rollout reports B is 3pp better than A. The z-test says: significant (z = 3.4, p < 0.001). The CpQU says: B is 1.8× more expensive per quality point. The decision:

  • Promote B if the 3pp absolute accuracy gain justifies the cost increase. (Possibly: yes for the curriculum's tutor; no for a high-volume product where cost dominates.)
  • Don't promote B otherwise.

The numbers don't make the choice. They separate the choice from the inertia that would otherwise be "newer is better, ship it".

Why we keep both numbers

The pair (significance, CpQU) is the bare-minimum decision packet. Either alone is misleading:

  • Significance alone says "B is reliably better" but says nothing about cost. A team that decides on significance alone migrates to ever-more-expensive models forever.
  • CpQU alone says "B is cheaper per quality point" but says nothing about whether the underlying quality difference is real or noise. A team that decides on CpQU alone may promote a noisy "better" model that's actually equivalent.

Together they bound the decision: significant + CpQU-favorable → promote. Significant but CpQU-unfavorable → operator judgement. Non-significant → don't promote, regardless of CpQU.

Drill problems (work these before lab 03)

Solutions in solutions/04-finops-ref.md — written at phase open.

  1. A has \(\hat{p}_A = 0.75\), \(n_A = 2{,}000\). B has \(\hat{p}_B = 0.77\), \(n_B = 2{,}000\). Compute z. Significant at 0.05? Sample size needed to detect a 1pp improvement at 80% power?
  2. Model M1 has cost-per-1k-tokens = $0.012, conjugation accuracy = 0.73. Model M2 has cost = $0.018, accuracy = 0.76. Compute CpQU for both. Which is operationally cheaper per quality point?
  3. Two A/B tests run in parallel: one on overall conjugation accuracy, one on refusal rate. Both report p = 0.04. Should you promote? Compute the Bonferroni-corrected threshold.
  4. The CpQU of the LoRA variant (v0.3.0) is 12% higher than the INT8 variant (v0.2.0), but the LoRA conjugation accuracy is 3pp higher. What additional data would push you to promote the LoRA into the default serving slot vs keeping it as a premium-tier variant?

One-paragraph recap

The two-proportion z-test decides if an A/B difference in conjugation accuracy is real, with a clear formula and a sample-size floor (typically ≥ 1,640 per arm for sub-3pp effects against a 75% baseline). Cost-per-quality-unit is the cost divided by the quality score, summarized in docs/COSTS.md per registered model. Together they answer "is it real?" and "is it worth it?". Operational decisions are made between these two numbers, not from either alone.

Next: theory/05-capacity-and-scaling.md.