English · Español
06 — Build / deploy / rollback for the §A13 grammar tutor; the CI matrix¶
🇪🇸 Una pipeline MLOps no es una pieza de infraestructura mágica: es el procedimiento que convierte un commit en un binario corriendo, con un camino de vuelta. Esta página describe ese procedimiento para el tutor §A13 concretamente, incluyendo qué corre CI en cada PR. Sin esto, Phase 40 no tiene de qué hacer postmortem.
The deploy flow in five steps¶
For the §A13 grammar tutor specifically (a small FastAPI app + a 500k-param model + the ground-truth verification table):
- Build. Snapshot the repo at a commit SHA. Lint + type-check (
just check). Run the full test suite (just test). Produce two artifacts: model.safetensors— the weights (immutable, content-addressed).tutor-<sha>.tar— the OCI image with FastAPI + verifier + model symlinked in.- Stage. Push image + model to the registry (
artifacts/for the curriculum; a real OCI registry in production). Tag with the commit SHA and the date. - Pre-deploy gate. Run the eval suite (
tests/eval/+data/exams/phase-39-capstone.yaml) against the staged artifact. If the eval score drops by > 2% vs the previous release → block. - Deploy. Atomic switch — either blue/green (two replicas, traffic-flip) or rolling update (gradual replacement). The grammar tutor is small enough that blue/green is the default; the cost of running two copies is negligible.
- Rollback path. Always present. The previous image's tag stays in the registry; one command (
just rollback) flips traffic back. No special procedure for "the rollback case" — it's the same atomic switch in reverse.
The non-negotiable: step 5 must exist before step 4 happens. A deploy without a rollback path is broken-by-design.
Step-by-step for the §A13 tutor¶
Step 1 — Build (CI on PR)¶
# .github/workflows/build.yml (sketch — actual file may differ)
on: pull_request
jobs:
build:
steps:
- lint: just lint
- type: just typecheck
- test: just test
- eval: just eval --gate-threshold 0.95
- build: just build-image
- push: just push artifacts/tutor-${SHA}.tar
Order matters: cheap checks (lint, type) run before expensive ones (eval). The CI matrix in §"Per-PR CI matrix" below details what runs.
Step 2 — Stage¶
The build artifact is pushed to a registry. For the curriculum, this is artifacts/ in the repo (per docs/phase-38-mlops/lab/00-registry-roundtrip.md). For production, a content-addressed OCI registry.
The registry stores at minimum: - the image (sha256-digest as primary key); - the model weights (sha256-digest as well; symlinked from the image); - a manifest mapping commit-SHA → image-digest → model-digest; - the eval scores at the time of build.
Step 3 — Pre-deploy gate (the eval suite)¶
The pre-deploy gate is the most important MLOps practice we teach. It is the deterministic substitute for "we'll watch the dashboards and revert if something goes wrong" — it catches regressions before users see them.
For the §A13 tutor, the eval runs:
- data/exams/phase-39-capstone.yaml — the capstone exam (current 5-ish items, plus the extra YAML).
- tests/eval/regression_corpus.jsonl — a frozen ~300-sentence regression set built across phases.
- tests/eval/prompt_injection_set.jsonl — the Phase 37 injection payloads with expected refusals/verifications.
If aggregate eval score drops > 2% relative to the previous release → the PR is blocked. The 2% threshold is calibrated; smaller models have noisier eval, larger have less; tune per-project.
Step 4 — Deploy (blue/green for the §A13 tutor)¶
Blue/green for a stateless service:
- tutor-blue running version N, taking 100% of traffic.
- tutor-green brought up running version N+1, healthy, taking 0% traffic.
- Health check + smoke test on green.
- Atomic traffic flip: load balancer points to green, blue stays warm.
- Wait the bake period (10 minutes for the tutor — enough for one quiz session at the portal).
- If no alerts in the bake → blue becomes the new spare. Otherwise → flip back.
For models with state (RAG indexes, fine-tune adapters): the data plane needs to handle the version skew during traffic flip. The tutor has no such state, so blue/green is clean.
Step 5 — Rollback¶
just rollback flips traffic back to the previous version. The previous image + weights are still in the registry, tagged by SHA. No special "rollback build": rollback is just deploy-but-pointing-at-an-older-tag.
The cost of rollback being available is one extra replica running (or warm in the registry). The cost of rollback being unavailable is documented in break/00-break-no-rollback.md.
Per-PR CI matrix¶
What runs on every PR. The matrix is intentionally small — bloated CI is its own anti-pattern.
| Stage | Runs on | Blocks merge? | Cost |
|---|---|---|---|
just lint |
every PR | yes | ~5 s |
just typecheck (mypy --strict on src/) |
every PR | yes | ~15 s |
just test |
every PR | yes | ~60 s |
just security (bandit + pip-audit) |
every PR (also nightly) | yes | ~20 s |
just eval --gate-threshold 0.95 |
every PR touching src/minitutor/ or src/minimoe/ or models/ |
yes | ~90 s |
just docs |
PRs touching docs/ |
no (warns) | ~10 s |
just bench --quick |
nightly + opt-in | no (publishes) | ~5 min |
| Full eval suite (slow) | nightly + on release tag | yes (on release) | ~10 min |
| Pretrained-baseline drift check | weekly | no (alerts) | ~5 min |
The split between PR-blocking (cheap, deterministic) and nightly-monitoring (slow, indicative) is intentional. PR-blocking should be ≤ 3 min total for a small change to feel fast; longer runs hide off the critical path.
The §A13 specific tests on every PR¶
Every PR that touches the tutor must run:
- Conjugation table integrity — the ground-truth table (
data/verbs.yaml) hash matches whatverify.pyexpects. - Tokenizer round-trip — encode → decode → equal on a fixed Spanish/English corpus.
- Tutor end-to-end smoke — POST a fixed sentence to a local server, get the expected JSON shape.
- Injection-set replay — the Phase 37 payloads, expect the verifier to reject the wrong ones.
- Latency budget — the p50 of a fixed 50-request set must stay within ±20% of the previous release's p50.
These are the minimum set. Adding more is fine; removing any of them is a one-way door — once you stop running a check, you stop knowing whether it would have caught the regression.
What goes wrong when rollback is missing¶
See break/00-break-no-rollback.md for the worked exercise: a bad release reaches users, and recovery time is dominated by the rebuild-from-scratch path because no prior image was preserved.
The short version: the cost of rollback being unavailable is measured in user-facing minutes of broken service, not in CI seconds. A 5-minute outage at the portal during a class is worse than 10 minutes of CI per release.
What this chapter does NOT cover¶
- Canary deployments (gradual traffic shift, e.g. 1% → 5% → 25% → 100%). Real, used at large scale; for the §A13 tutor, blue/green is simpler.
- Feature flags for in-app A/B (different from blue/green; ships both versions and toggles per user). Phase 38 lab 01 covers shadow-A/B at a lighter level.
- Multi-region deployment — out of §A13 scope.
- Data versioning for fine-tunes — DVC + MLflow story; cross-ref
docs/phase-38-mlops/lab/00-registry-roundtrip.md.
Reference¶
- Forsgren, Humble, Kim, "Accelerate: The Science of Lean Software and DevOps" (2018). The four key DORA metrics (deploy frequency, lead time, MTTR, change failure rate) operationalize the deploy flow above.
- Google SRE Book, ch. "Release Engineering". Blue/green and rollback discipline.
- Sculley et al., "Hidden Technical Debt in Machine Learning Systems" (NeurIPS 2015). Why MLOps differs from ops; the "small piece of glue" warning.
Next: ../break/00-break-no-rollback.md and the lab 04-ci-deploy-gate.md.