Skip to content

English · Español

02 — Traffic Strategies: A/B vs Shadow vs Canary

🇪🇸 Tres estrategias, tres preguntas distintas. A/B = ¿cuál corrige mejor en producción? Shadow = ¿la nueva versión es segura sin que el aprendiz lo note? Canary = ¿puedo desplegar progresivamente con rollback rápido? Confundirlas envía correcciones gramaticales no validadas a aprendices reales — y para un tutor de gramática, una corrección equivocada es peor que ninguna.


The three strategies, in one table

Strategy What learner sees What is logged Decision criterion Risk profile
A/B A or B (random hash split) Both responses + offline grading later Statistical test on grading High — both versions touch real learners
Shadow Always A (production) A served + B computed but discarded; offline diff Offline comparison; no online metric Low — learner never sees B's correction
Canary A or B (X% to B, ramping) All responses with assignment Online metric on B subset; auto-rollback if metric drops Medium — limited blast radius

Each strategy answers a different question. Borja must internalize that the strategies are not interchangeable — picking the wrong one is a class-of-incident in industry.

A/B testing

Question: "Given two candidate grammar tutors A and B in production, which produces a better outcome?"

Mechanism: every incoming English sentence is assigned to A or B by hashing (request_id, salt) mod 2. Both responses are served (one to each chunk of users). Some user-side metric (was the correction accepted, did the user re-submit a revised sentence, did the user dismiss the suggestion) is logged. After enough samples, a statistical test (two-proportion z-test for binary outcomes, t-test for continuous) decides.

Strengths: unambiguous causal signal. The randomization is the design.

Weaknesses:

  1. Both versions ship. Bugs in B are observed by real learners. Not appropriate for validating a new tutor — appropriate for choosing between two already-validated tutors.
  2. Statistical-test prerequisites. The metric must be measurable per-request, the sample size large enough, and the analysis pre-specified (not p-hacked).
  3. Multiple-test issues. If you A/B many things at once, the false-positive rate compounds. Bonferroni or sequential-testing correction is needed.

For the grammar tutor, the user metric is hard to define online — "was the correction right?" requires reference to the canonical conjugation table, not a learner click. A learner might dismiss a correct correction because they disagree, or accept an incorrect one because they trust the tool. A/B is therefore mainly useful for latency and availability comparisons here, not for correctness.

Shadow traffic

Question: "Is candidate tutor B safe to promote? Are its corrections at parity with production A, plus or minus some tolerable diff?"

Mechanism: every incoming sentence goes to A and is also computed by B in parallel. A's correction is served to the learner. B's correction is logged for offline analysis. The learner never sees B.

Strengths:

  1. Zero user-facing risk. B can be arbitrarily broken — the learner only ever sees A.
  2. Diffability. You see exactly where A and B disagree. For grammar correction, that's a list: "on these 47 input sentences, A said 'goed → went', B said 'goed → goen'." Easier to triage than a metric drop.
  3. No statistical-test machinery required. Just exact diff + grading against the canonical Phase 20 conjugation table.

Weaknesses:

  1. 2× compute. Every request runs through both models. For an inference stack already running near capacity, shadowing doubles the cost.
  2. No online signal. You can verify that B disagrees with A 7% of the time, but you can't tell whether those disagreements are improvements or regressions without offline reference comparison (which the Phase 20 eval set provides — we always have ground truth for grammar).
  3. State drift. If A and B have different tokenizers (e.g., B is a LoRA on a different base tokenizer), the diff includes noise from tokenization, not just model behavior. We enforce same-tokenizer for A/B/shadow pairs.

For the grammar tutor, shadow is the default for new model evaluation. Serving a wrong conjugation correction to a learner trying to learn is a real harm; 2× compute is cheap by comparison.

Canary deploys

Question: "Can I progressively roll out tutor B with a fast rollback if something goes wrong?"

Mechanism: start with 99% A, 1% B. After T₁ minutes with no metric regression, move to 95/5. Then 80/20, then 50/50, then 100% B. At every step, an automated metric guard (refusal-rate up? p99 latency up? error rate up?) triggers an automated rollback — flip back to 99% A.

Strengths:

  1. Bounded blast radius. At 1%, a catastrophic bug in B affects 1% of learners. By 50%, the bug should have been caught.
  2. Reversibility. Rollback is a config flip, not a redeploy.
  3. Online signal. Unlike shadow, the rollout produces real user-facing metrics on B (refusal rate, latency, error rate).

Weaknesses:

  1. Requires online metrics that don't need ground truth. For the grammar tutor, the online metrics are: refusal rate, latency, error rate, average response length. Correctness still requires offline grading. Canary catches operational regressions, not correctness regressions.
  2. Slow. A typical canary takes 30–120 minutes per stage. For frequent deploys, this becomes a bottleneck.
  3. Confounded with time. A spike in latency during the canary might be due to B or due to the load that hour. Holding A as a control during the canary mitigates this.

For the grammar tutor, canary is the production rollout mechanism after shadow has validated B. It is not a substitute for shadow.

The combined playbook

The intended flow for the grammar tutor (and most real systems) is:

1. Offline: train + Phase 20 eval. Pass → candidate.
2. CI deploy gate: per-bucket conjugation accuracy doesn't regress vs eval_baseline.json. Pass → promote to "candidate" semver.
3. Shadow: 100% of production traffic mirrored to B for ≥ 1 week. Offline diff against the canonical conjugation table.
4. Offline diff review. Pass → ready-to-canary.
5. Canary: 1% → 5% → 20% → 50% → 100% over hours/days, with automated metric guards (latency, refusal, error rate).
6. Promote: B is now production. Old A enters the registry as historical.
7. A/B (optional): for *choice* between two equally-validated tutors, run an A/B on a clear online metric (latency, refusal rate, or learner-side acceptance proxy).

This pipeline catches different failure modes at each stage:

  • Stage 1 catches eval-set regressions.
  • Stage 2 catches bucket-level regressions invisible to aggregate metrics ("overall accuracy held, but accuracy on the 3rd-person-singular past participle dropped 8pp").
  • Stage 3 catches production-distribution regressions invisible to eval (the eval set has all 20 verbs equally; live traffic may be skewed toward "go", "have", "be").
  • Stage 5 catches operational regressions (memory leaks, latency spikes, infrastructure incompatibilities).
  • Stage 7 catches user-preference differences.

Skipping any stage means a failure mode goes unmonitored.

Routing mechanics

All three strategies need a router — code that decides, per request, which model handles it. We add this in src/miniserve/traffic.py (extending the Phase 33 module — see PHASE_38_PLAN.md §3 on the no-new-module constraint). Design constraints:

  1. Stateless. The router decision is hash(request_id, salt) mod denominator. No database, no session state. This makes the router trivially replicable across N inference workers.
  2. Sticky. A given request_id always routes to the same model within a session. Otherwise A/B comparisons are polluted by users seeing both.
  3. Configurable per route. Different endpoints can have different strategies. /v1/grammar/correct runs shadow during the rollout window; /v1/latency-test runs A/B for benchmark studies.
  4. Observable. Every router decision is logged with the trace ID. Phase 34's miniobserve adds a traffic.arm span attribute and a traffic_assignments_total{arm,strategy} counter.

The router does not do load balancing, retries, or circuit-breaking — those belong in src/miniserve/'s existing request pipeline (Phase 33). The router only decides which model handles a request that has already been routed to a worker.

Anti-pattern: "soft launch"

A common industry mistake: "we'll just deploy B to everyone for 5 minutes and see if anything breaks". This is neither A/B (no comparison), nor shadow (users see B), nor canary (no progressive rollout). It is a prayer. Phase 38's playbook explicitly rules this out. The CI gate (theory/05) is the structural defense — promotion happens through CI, not through a manual deploy. There is no "just deploy and watch" path.

What "metric guards" actually mean

The automated rollback for canary requires guards — predicates over a metric stream that, if violated, trigger a flip. Examples for the grammar tutor:

  • p99_latency_increase > 30% over a 5-minute window vs the A baseline → rollback.
  • error_rate_5xx > 0.5% over a 1-minute window → rollback immediately.
  • refusal_rate_delta > 5pp (percentage points) over 10 minutes → rollback. A spike in "I can't determine the correct form" responses is a signal that B has lost capability.
  • avg_response_length_delta > 50% over 10 minutes → rollback. Wildly longer responses may indicate a generation loop.

The guards are codified in a canary_config.yaml (loaded by src/miniserve/) and not negotiated mid-rollout. They are committed before the rollout starts and reviewed in the same PR that bumps the candidate semver.

Stickiness and learner experience

A grammar tutor often sees the same learner submit several sentences in a session. Sticky assignment ensures one learner always sees the same tutor — otherwise the learner might see "I went" accepted by A on sentence 1, then "I goed" accepted by B on sentence 2, then "I went" again on sentence 3. The inconsistency is worse than either tutor alone.

The salt is per-deploy: prod-2026-MM. Bumping the salt re-shuffles assignment, which is sometimes desired (e.g., to break correlation between a specific learner and a specific arm in repeated A/Bs). It is not desired during a single experiment.

Drill problems (work these before lab 01)

Solutions in solutions/02-traffic-ref.md — written at phase open.

  1. A new LoRA grammar tutor (Phase 28) has 5pp better conjugation accuracy on the Phase 20 eval. Which strategy do you use to validate it in production, and in what sequence? Justify.
  2. A bug in src/miniserve/traffic.py flips the hash, so learners sometimes see A and sometimes B in the same session. What metric is corrupted, and how do you detect this from logs?
  3. A canary at 5% reports a 2pp drop in refusal rate. Is this a regression or an improvement? What additional data do you need to decide?
  4. You're running shadow with A as production and B as candidate. On a particular sentence "She has went to the store", A says "correct" (wrong — should flag "has went" → "has gone"), B says "She has gone to the store" (right). The diff log captures this. What changes about your promotion decision relative to a case where both models agree?

One-paragraph recap

A/B answers "which is better?"; shadow answers "is the candidate safe?"; canary answers "can I roll out with rollback?". They are not interchangeable. The combined playbook is CI gate → shadow → canary → promote, optionally followed by A/B for choices. The router (in src/miniserve/traffic.py) is stateless, sticky, and configurable per route. Metric guards on canary are pre-committed predicates, not negotiations during rollout. The grammar tutor's correctness signal is fundamentally offline (against the canonical conjugation table) — canary catches operational regressions, shadow catches correctness regressions.

Next: theory/03-drift-detection.md.