English · Español

00 — Motivation: from demo to service¶

🇪🇸 La tesis de esta fase: lo que distingue una demo de un servicio no es escala — es trazabilidad y bloqueo de regresiones. Si no puedes responder "¿qué checkpoint corrigió esa conjugación?", "¿con qué corpus se entrenó?", "¿cuánto cuesta cada corrección?", "¿por qué este modelo entró a producción y aquel no?", no tienes un servicio — tienes un script con suerte.

The question this phase answers¶

After Phase 37 closes, Borja has a working grammar-tutor stack: BPE tokenizer trained on English verb sentences (Phase 11), Mini-GPT trained on the 20-verb × 5-tense × 3-person grid (Phase 18), INT8 + LoRA variants (Phases 26 + 28), inference server (src/miniserve/, Phase 33), observability stack (src/miniobserve/, Phase 34), security audit (Phase 37). All of it runs. None of it is operable in the sense a real service requires.

The Phase 38 question is the operational one:

"Six months from now, when a learner reports that the tutor accepted 'he goed' as correct, can you reconstruct what happened?"

The answer requires five artifacts that no prior phase produced:

A registry — every grammar-tutor model that ever served traffic, indexed by content SHA.
A lineage — for each registered model, the chain back to the verb-corpus DVC hash, code git SHA, hyperparameters, seed.
A traffic strategy — a deliberate, named way of moving from one tutor to another (A/B, shadow, canary).
A drift signal — a number that tells you when the verbs and tenses learners actually submit no longer look like what you trained on.
A CI gate — automation that refuses to promote a checkpoint whose Phase 20 conjugation accuracy regresses.

Plus a sixth, optional but cheap:

A cost-per-quality number — so "newer is better" can be falsified by "newer is also more expensive than the quality bump warrants".

Why this is infrastructure, not framework¶

There is an entire industry of MLOps tools that promise to solve this for you: MLflow, Weights & Biases, Kubeflow, Sagemaker Model Registry, Vertex AI Model Garden, Determined.ai, ClearML, DVC Studio, Comet, Neptune, and a long tail of startups. Most of them work. Phase 38 wraps the two we already pinned (MLflow + DVC, per §A8) and builds the minimal glue around them. Two reasons:

The semantic gap. "What is a registered model?" has a different answer at every tool. Some treat the artifact as a tarball; some as a directory; some as a manifest. Some track lineage automatically; some require explicit annotations. MLflow's model registry is convenient but allows non-deterministic SHAs across re-registration (its hash is over the on-disk layout, not canonical content). Until Borja has written one himself — even as a 150-LOC wrapper — he can't tell when the chosen tool is silently lying about reproducibility.
The microscopic-scope rule. This curriculum's spec (§10 anti-goal) excludes langchain and friends because abstraction without first-principles understanding produces theater, not skill. The same principle applies to MLOps. Build the 150-LOC registry wrapper; understand what every line does; then (post-curriculum, if Borja chooses) adopt an industry tool with eyes open.

The crucial discipline: we use MLflow as the artifact backend (it has the storage abstraction; we don't need to rewrite that). We do not use MLflow's model registry as the source of identity. The canonical entry SHA is computed by scripts/mlops/registry.py over a canonical bundle (safetensors weights + sorted-keys JSON config + parent SHA), and that SHA is the primary key for everything downstream.

What "operability" looks like, concretely¶

Imagine Borja's Phase 39 capstone has been running for two weeks. A user reports: "The tutor said 'I goed to the store' was correct yesterday. That's wrong — it should be 'I went'."

A demo can't answer follow-up questions. A service can answer all of these:

Which checkpoint produced that response? — traces[<request_id>].model_sha.
What was the input distribution that day? — drift_reports/2026-XX-YY.json.
What corpus was that checkpoint trained on? — lineage(<sha>).corpus_dvc_hash resolves to a specific commit of the verb-corpus.
Did the corpus include "go → went" in the irregular set? — dvc pull data/processed/train.jsonl@<corpus_hash> and inspect.
What was the eval score before deploy? — registry.get(<sha>).eval_report.conjugation_accuracy_by_verb["go"].
Was the response served by the production model, a shadow, or a canary? — traces[<request_id>].traffic_assignment.
Did this checkpoint pass the CI gate? — gh run view <run_id> --log shows the eval pass-rate at promotion time.
Can we roll back? — just rollback <previous_sha>.

Each of those answers is a row in a manifest, a tag on an MLflow run, or a call into scripts/mlops/. None is bolted on after the fact — every phase from 11 onward has been writing manifests precisely because Phase 38 reads them. Phase 12's corpus_manifest.json (the DVC-tracked corpus version) is the root of the lineage walk; Phase 18's MLflow run is the next hop; Phase 28's LoRA adapter has its own MLflow run with parent_run_id pointing at Phase 18's; Phase 38 wraps the whole chain behind one lineage(<sha>) call.

Why CI matters here specifically¶

A grammar tutor that occasionally accepts "he goed" is worse than no tutor at all — the user trusts the wrong correction. The floor for promotion has to be enforced by code, not memory. Concretely:

The .github/workflows/deploy-grammar-tutor.yml workflow runs the full Phase 20 eval against the candidate checkpoint every time.
It compares per-bucket conjugation accuracy against eval_baseline.json, a committed file.
If any bucket regresses by more than the configured tolerance (default 2pp), the workflow fails.
Promotion in the registry happens only as the last successful step of that workflow. There is no manual register.py --force path in production usage. The path exists for local development.

The result: every model in production has, by construction, passed the eval gate. The CI run is the audit trail; the registry is the index; the trace is the per-request answer.

What this phase explicitly is not¶

It is not:

An autoscaling guide. (Vocabulary only — see theory/05.)
A FinOps consulting deck. (One number per registry entry — see theory/04.)
A Kubernetes deployment manual. (infra/k8s/ is optional, off the critical path.)
A complete drift-detection survey. (Two metrics — KL and PSI — sufficient for the corpus we're operating against.)
A new src/<module>/. Phase 38 lives in scripts/mlops/, src/miniserve/, src/miniobserve/, and .github/workflows/.

It is the smallest possible MLOps spine that makes the Phase 39 capstone reproducible by someone who is not Borja, and refuses to deploy a worse model.

What this phase does NOT cover¶

A bespoke registry storage backend. MLflow's artifact store is sufficient — we wrap it for canonical hashing, not replace it.
A bespoke pipeline DAG runner. The Justfile is the orchestrator. If a workflow exceeds ~15 recipes, revisit at Phase 40.
Alerting / paging integration. Drift detection writes a report; deciding what to do about a high-PSI alert is operator territory, not Phase 38's. PagerDuty / Alertmanager / Slack hooks are future work.
Multi-region anything. Single deployment, single MLflow tracking store. Multi-region MLops is its own multi-month project.
GPU-sharing benchmarks. No CUDA on the laptop.
Tuning the eval baseline. eval_baseline.json is committed; revising it is a PR with justification, not an automatic regeneration.

The pedagogical move¶

Borja has spent 37 phases building components. Phase 38 forces him to think about the system: how those components compose, fail together, and need to be replaced piecewise without disrupting service. This is the inflection point where the curriculum stops being "build the algorithm" and becomes "operate the algorithm without breaking it on Tuesday".

This shift matters for Phase 39 (capstone integration) and Phase 40 (postmortem). Without the Phase 38 spine, the capstone is a video demo; with it, the capstone is a system that another learner (or a future Borja) can rebuild from the git history alone.

Recap¶

The motivation is operability + regression-blocking. Operability is built from six artifacts: registry, lineage, traffic strategy, drift signal, CI gate, cost-per-quality. The next five theory files derive each artifact's mechanics. The five labs then wire them all together over the existing Phase 33 server and Phase 34 observability stack — no new src/<module>/.

Next: theory/01-registry-and-lineage.md.