Skip to content

English · Español

05 — CI Deploy Gates and Capacity Vocabulary

🇪🇸 Esta fase enseña dos cosas distintas en el mismo archivo: (1) el gate de CI — la única forma legítima de promover un checkpoint del tutor a producción, con su lógica de comparación contra una baseline versionada — y (2) un tour de vocabulario sobre capacidad (HPA por tokens/seg, MIG/MPS, spot vs on-demand), que aparece en toda la literatura aunque Borja no lo ejercite localmente.


Part 1: CI as the only path to production

The thesis

A grammar tutor that occasionally accepts "he goed" is worse than no tutor at all — a learner trusts the wrong correction and internalizes it. The floor for promotion has to be enforced by code, not memory or convention. Phase 38's structural defense is: promotion to the production semver tag happens only as the last successful step of a CI workflow that ran the Phase 20 eval against the candidate and compared per-bucket accuracy against a committed baseline.

There is no --force flag. There is no "I'll just promote it locally and check later". The local registry.register_pending(run_id) exists for development; the local registry.promote(canonical_sha, semver) is disabled in environment LYNX_ENV=prod and only runnable by the CI workflow's GitHub Actions identity.

The workflow shape

.github/workflows/deploy-grammar-tutor.yml has six stages:

flowchart LR
    A[trigger: PR labeled\n'deploy-candidate'] --> B[checkout + uv sync]
    B --> C[dvc pull eval set + corpus]
    C --> D[mlflow download artifact\n by run_id]
    D --> E[run Phase 20 eval]
    E --> F{per-bucket\nregression\n> 2pp?}
    F -->|no| G[registry.promote\n+ semver tag]
    F -->|yes| H[fail workflow\n+ comment on PR]
    G --> I[notify channel:\n'promoted v0.X.Y']

Each stage is shell + a single Python script:

  1. Trigger. A PR carrying the label deploy-candidate and a comment mlflow_run_id=<id> semver=v0.3.1 triggers the workflow. The label is gated by CODEOWNERS — only Borja can label.
  2. Checkout + sync. Standard. uv sync --frozen to install the locked deps; no fresh resolution.
  3. DVC pull. dvc pull data/eval/phase-20.jsonl.dvc and the live corpus. The eval-set DVC hash is asserted against the value in eval_baseline.json's eval_set_dvc_hash field — mismatch fails immediately. This is the audit trail's guarantee that you're grading against the same eval set every time.
  4. MLflow download. mlflow.artifacts.download_artifacts(run_id=...) pulls the candidate bundle into ./candidate/.
  5. Eval. python -m minieval --bundle ./candidate/ --eval-set data/eval/phase-20.jsonl --output candidate_eval.json. Writes per-bucket accuracy.
  6. Compare and decide. python -m scripts.mlops.compare_baseline --candidate candidate_eval.json --baseline eval_baseline.json --tolerance 0.02. Exit code 0 → promote; non-zero → fail.

Promotion is the last step. Failure at any earlier stage means no registry change happens. This is the no-half-state contract.

The baseline file

eval_baseline.json lives at the repo root, alongside LYNX_CORTEX.md. Its shape:

{
  "eval_set_dvc_hash": "...",
  "baseline_sha": "<previous_promoted_canonical_sha>",
  "tolerance_pp": 0.02,
  "buckets": {
    "tense.present_simple": {"accuracy": 0.81, "n": 240},
    "tense.past_simple":    {"accuracy": 0.77, "n": 240},
    "tense.past_participle":{"accuracy": 0.72, "n": 240},
    "tense.simple_future":  {"accuracy": 0.78, "n": 240},
    "tense.infinitive":     {"accuracy": 0.88, "n": 240},
    "person.1sg":           {"accuracy": 0.82, "n": 400},
    "person.2sg":           {"accuracy": 0.79, "n": 400},
    "person.3sg":           {"accuracy": 0.71, "n": 400},
    "verb.go":              {"accuracy": 0.74, "n":  60},
    "verb.be":              {"accuracy": 0.69, "n":  60}
  }
}

(Numbers illustrative; Borja measures the real baseline at phase open.)

The comparison rule: for each bucket, candidate accuracy ≥ baseline accuracy − tolerance_pp. Any bucket that regresses by more than the tolerance fails the workflow.

The tolerance is per-bucket, not aggregate, because aggregate accuracy can hide bucket-level catastrophes. A model that gains 2pp on present-simple while losing 8pp on past-participle has flat aggregate but is materially worse for past-participle learners.

Updating the baseline

eval_baseline.json is committed. Updating it is a PR with three rules:

  1. One reviewer minimum beyond the author.
  2. Diff includes a justification in the PR description (link to the experiment that established the new numbers).
  3. The previous baseline's SHA is preserved as previous_baseline_sha in the new file. The history of baselines is git history.

Updating the baseline is normal — you do it whenever a legitimate model improvement establishes a new floor. The discipline is that updating isn't silent.

The non-CI escape hatch (and why it exists)

For local development, Borja can call registry.register_pending(run_id) directly and even registry.promote(canonical_sha, semver="v0.0.0-dev") if LYNX_ENV != prod. This allows iteration without CI cycles.

The escape hatch is bounded: - dev semvers are namespaced and never resolved by src/miniserve/ in production. - The registry's index.jsonl records the LYNX_ENV at registration time. Audits can filter where env == prod. - The Justfile recipe just register-model <run-id> warns loudly that it's not the CI path.

The point: friction is the design. Local promotion is possible but explicit.

What the gate does NOT cover

  • Latency regressions. The gate checks correctness, not throughput. Phase 34's load test catches latency drift; a separate latency-gate.yml workflow could be added later but is not part of Phase 38's scope.
  • Resource regressions. Memory footprint, VRAM use — same as latency, separate gate.
  • Smoke tests on the actual serving stack. The gate downloads the candidate and runs eval in isolation. A "does miniserve actually start up with this model" smoke test belongs to the Phase 39 capstone.
  • Multi-model dependencies. If the tutor calls a retriever (Phase 29) that itself has versions, the gate doesn't compose them. Future work.

The gate is the minimal deploy regression check. It is sufficient because the grammar tutor's primary failure mode is correctness regression, and the per-bucket comparison is correctness-focused.

Part 2: Capacity vocabulary (no experiments)

Phase 38's spec lists "autoscaling, GPU-sharing, spot vs on-demand" as concepts. None of them have an experiment in Phase 38 — they would require a cloud-deployed serving stack at non-trivial scale, which the curriculum reaches only at the capstone (Phase 39, minimal scale) and explicitly does not pursue at production scale (anti-goal §10).

This section is the vocabulary tour: enough understanding that Borja can read industry papers, vendor docs, and on-call playbooks without each term being a black box.

Autoscaling: HPA on the right metric

A Horizontal Pod Autoscaler (HPA) is the Kubernetes mechanism that grows or shrinks a deployment based on a metric. For a CPU-bound web service, the metric is CPU utilization. For an inference service, that's the wrong metric.

A modern inference server is GPU-bound (or, on CPU, memory-bandwidth-bound), not CPU-bound. The CPU is doing tokenization, batch packing, and HTTP. Even saturating the GPU might leave CPU at 30%. HPA-on-CPU under-scales the deployment and produces head-of-line blocking.

The right metrics for LLM inference:

  1. Tokens-per-second per replica. If this saturates, you need more replicas. Requires the serving stack (Phase 33) to export it as a Prometheus metric.
  2. Queue depth (pending requests). Cheap, predictive of latency tail. The right metric for an auto-batching server.
  3. p99 latency. Downstream — once latency degrades, you've already been overloaded.

Recommended HPA metric for the grammar tutor (if and when we deploy at scale): queue depth, with tokens/sec as a secondary scaling signal.

Anti-pattern: HPA on a flapping metric. HPA reads the metric every N seconds. If the metric oscillates (which queue depth does naturally), HPA can spawn and kill replicas every N seconds — flapping. Mitigations: HPA stabilization window (stabilizationWindowSeconds), hysteresis bands. None of this matters until you deploy. Mentioned here so the term doesn't surprise Borja later.

GPU sharing: MIG and MPS

Modern NVIDIA GPUs can be subdivided so that several processes (or several containers) share the same physical card.

MIG (Multi-Instance GPU). Available on A100, H100, B100. The GPU is partitioned at the hardware level into 1–7 instances, each with its own memory partition, compute slice, and PCIe queue. Hard isolation: a process in instance 1 cannot starve instance 2. Use case: multi-tenant inference where you want to guarantee a single tenant's noisy traffic doesn't degrade another tenant's latency. Tradeoff: each partition has fewer SMs and less HBM — a 7-way split A100 is 7× smaller than a single A100, fine for small models, terrible for a 70B model that needed all the HBM.

MPS (Multi-Process Service). Software multiplexing of a single GPU across multiple CUDA processes. No hardware partitioning. All processes share the GPU's full memory and SMs; CUDA contexts are merged so kernels from different processes can run concurrently when there's spare compute. Use case: many small inference processes that individually under-utilize the GPU. Tradeoff: no isolation. A process that allocates all the memory or runs a long kernel can starve the others.

Time-sharing. Each process gets the whole GPU for a quantum, then context-switches. Terrible for latency-sensitive inference (context switch is expensive). Mostly batch training.

Practical recommendation for the grammar tutor: single-tenant inference, small model. One process, one GPU (when there's a GPU). Don't share. MIG/MPS become relevant only when multi-tenant or multi-model scenarios appear, neither of which is in Phase 39's scope.

Spot vs on-demand

Cloud GPU instances come in two flavors:

  • On-demand: you pay full price; the provider commits to availability.
  • Spot (AWS) / Preemptible (GCP): you pay 50–90% less; the provider can reclaim the instance with 30 seconds to 2 minutes of notice.

When spot makes sense: - Training. Long-running, checkpointable, restartable. A spot reclaim costs you the latest few minutes of training, not the whole run. - Batch inference / eval. Same logic. - Asynchronous serving with a queue that can tolerate minute-level interruptions.

When spot doesn't: - Synchronous online inference. A learner submitted a sentence 200ms ago. The instance gets reclaimed. The request fails. Don't. - Workloads with state that's expensive to rebuild (KV cache for a long session, in-memory tokenizer caches).

For the grammar-tutor Phase 23+ training jobs: spot. For Phase 33 online serving: on-demand, possibly with spot as cheap secondary replicas for shadow traffic.

Reserved instances. Commit to a year or three of usage in exchange for a 30–60% discount. Worth it only at known steady-state load.

We covered the math in theory/04. The operational note here: every cost decision is a tradeoff. Lower CpQU is desirable; sub-second latency is desirable; high availability is desirable. You cannot maximize all three. The decision belongs to a human reading the registry's CpQU column, the serving stack's latency dashboard, and the SLO. The MLOps spine surfaces the numbers — it does not make the decision.

Glossary additions (added to GLOSSARY.md at phase-open)

  • HPA — Horizontal Pod Autoscaler. Kubernetes mechanism for scaling replicas on a metric.
  • MIG — Multi-Instance GPU. Hardware partitioning of A100/H100.
  • MPS — Multi-Process Service. Software multiplexing of a single CUDA GPU.
  • Spot — Discounted, reclaimable cloud instance.
  • Preemptible — GCP's name for spot.
  • Stabilization window — HPA setting to prevent oscillation.
  • CpQU — Cost per Quality Unit; defined in theory/04.
  • PSI — Population Stability Index; defined in theory/03.
  • Eval baseline — committed eval_baseline.json; the floor the CI deploy gate enforces.
  • Deploy gate.github/workflows/deploy-grammar-tutor.yml; the only legitimate promotion path.

What this phase does not commit to (capacity edition)

Phase 38 does not produce:

  • A deployed cloud serving stack.
  • An autoscaling demo.
  • An MIG/MPS benchmark.
  • A spot-vs-on-demand cost spreadsheet for real workloads.

Those are Phase 39's optional infra/k8s/ material if and only if Borja chooses to deploy the capstone to a cloud cluster (not required by the spec; see LYNX_CORTEX.md §4 PHASE 39).

One-paragraph recap

CI is the only path to production. The deploy gate (.github/workflows/deploy-grammar-tutor.yml) downloads the candidate via MLflow, pulls the eval set via DVC, runs Phase 20 eval, and compares per-bucket conjugation accuracy against eval_baseline.json with a 2pp tolerance — any bucket regressing by more fails the workflow. Promotion is the last step; failure means no registry change. Local development has an escape hatch (dev semver namespace) but the production path is CI-only. Capacity vocabulary (HPA, MIG, MPS, spot) is covered for reading literature; no Phase 38 experiments — Borja has no GPU on the laptop and the capstone is single-node.

Next: the labs — lab/00-registry-roundtrip.md.