English · Español
Lab 04 — CI deploy gate: regression-block the grammar tutor¶
Goal: wire
.github/workflows/deploy-grammar-tutor.yml. Verify it promotes a clean candidate; verify it refuses a deliberately-regressed candidate.Estimated time: 3–5 hours.
Prereq: labs 00–03 done.
eval_baseline.jsoncommitted at the repo root with per-bucket accuracy from the active production registry entry. MLflow tracking server reachable from the GitHub Actions runner (for the curriculum: a public read-only MLflow on the laptop, or in-CI MLflow with the artifacts pushed as a workflow artifact).
What you produce¶
experiments/38-ci-gate/ containing:
degraded_model/— a copy of an existing registered checkpoint with a deliberate per-bucket regression injected (e.g., weights of the past-participle head perturbed; or a wrapper that randomizes 5% of past-participle outputs).register_degraded.py— uploads the degraded model to MLflow, captures the run_id.pr_clean.md— narrative + screenshot of the CI run that passed and promoted a clean candidate.pr_regressed.md— narrative + screenshot of the CI run that failed on the regressed candidate.manifest.json.
Plus, outside the experiment directory:
.github/workflows/deploy-grammar-tutor.yml— the CI workflow.scripts/mlops/compare_baseline.py— the comparison logic invoked by the workflow.eval_baseline.json— committed baseline (may already exist after lab 00).
TODOs¶
Block A — write the workflow¶
- Create
.github/workflows/deploy-grammar-tutor.yml. Trigger: PR labeleddeploy-candidatewith a comment containingmlflow_run_id=<id> semver=v0.X.Y. - Six stages per
theory/05Part 1: actions/checkout@v4+astral-sh/setup-uv@v3+uv sync --frozen.dvc pull data/eval/phase-20.jsonl.dvc(and the corpus if needed). Assertdvc hashmatcheseval_baseline.json["eval_set_dvc_hash"].mlflow artifacts download --run-id ${{ inputs.run_id }} --dst-path ./candidate/.python -m minieval --bundle ./candidate/ --eval-set data/eval/phase-20.jsonl --output candidate_eval.json.python -m scripts.mlops.compare_baseline --candidate candidate_eval.json --baseline eval_baseline.json --tolerance 0.02. Exit 0 = pass, non-zero = fail.- On pass:
python -m scripts.mlops.registry promote --canonical-sha <derived from candidate> --semver <from PR comment>. Thengh pr commentwith the promoted SHA + semver. On fail:gh pr commentwith the failing buckets and exit non-zero. - Pin every action to a SHA, not a tag (
@v4→@<sha>). Supply-chain hygiene (cross-referencesecurity/supply-chain.md).
Block B — write the comparison logic¶
-
scripts/mlops/compare_baseline.py: - Loads
candidate_eval.jsonandeval_baseline.json. - For each bucket in the baseline, checks
candidate.bucket.accuracy >= baseline.bucket.accuracy - tolerance_pp. - If any bucket fails, prints a markdown-formatted table of failing buckets to stdout and exits with code 2.
- If all buckets pass, prints a one-line summary and exits 0.
- Unit-test it: synthetic candidate + baseline JSONs, both pass and fail cases. Tests live in
tests/mlops/test_compare_baseline.py.
Block C — clean-candidate dry run¶
- Open a PR titled
chore: deploy v0.3.1 (rerun of LoRA grammar tutor, no model change). Add labeldeploy-candidate. Commentmlflow_run_id=<existing_lora_run_id> semver=v0.3.1. - Watch the workflow run. Expected behavior: stage 5 passes (no regression vs baseline), stage 6 promotes. Confirm the registry's
index.jsonlhas a new line andtags.jsonhas the new semver. - Screenshot the green CI run + the promoted-comment from the bot.
- Save as
pr_clean.mdin the experiment directory.
Block D — regression-injection¶
- Take an existing registered model (the LoRA grammar tutor is a good candidate). Apply a targeted perturbation:
- Option 1 (simpler): wrap the model in
degraded_model/wrapper.pythat randomly mangles 10% of past-participle outputs. This is easier than perturbing weights but is observable as a regression by the eval gate. - Option 2 (deeper): add Gaussian noise to the LoRA adapter's
Bmatrix for the past-participle decoder slice. More realistic but slower to inject. - Run the eval locally to confirm: aggregate accuracy may stay close to baseline, but the past-participle bucket drops by > 2pp.
- Upload to MLflow as a new run; capture the run_id.
Block E — regressed-candidate dry run¶
- Open a PR titled
chore: deploy v0.3.2 (degraded — should fail CI). Add labeldeploy-candidate. Commentmlflow_run_id=<degraded_run_id> semver=v0.3.2. - Watch the workflow run. Expected behavior: stage 5 fails because
tense.past_participleregresses by more than 2pp. The bot comments on the PR with the failing bucket. Stage 6 does not execute. The registry does not change. - Screenshot the red CI run + the failing-bucket comment.
- Save as
pr_regressed.mdin the experiment directory. - Close the PR without merging — the point is that the gate worked, not that we want this PR's changes in main.
Block F — Justfile recipes + manifest¶
- Add
just register-model <run-id> <semver>to invoke the local (dev-mode) promotion path with a warning that the production path is CI. - Add
just compare-baseline <candidate.json>to dry-run the comparison locally. -
manifest.jsonlists: the clean run_id used for Block C, the degraded run_id used for Block E, theeval_baseline.jsonSHA at the time of each PR.
Constraints¶
- No
--forceflag in the production code path. If you find yourself adding one, stop. Local dev hasLYNX_ENV=devmode; production has CI. There is no third path. - Pinned GitHub Actions. Every action used in the workflow is pinned by commit SHA, not by tag. Supply-chain rule.
- No secrets in the workflow. MLflow access uses repo-level GitHub Actions secrets (set in the repo settings, not committed). Document the secret names in
pr_clean.md. - The baseline is committed.
eval_baseline.jsonis a tracked file. Updating it is its own PR. - No new
src/<module>/.compare_baseline.pyandregistry promotelive inscripts/mlops/.
Stop conditions¶
Done when:
- The workflow file exists and is syntactically valid (
gh workflow listshows it). - The clean-candidate PR closed with a successful CI run and a new registry entry.
- The regressed-candidate PR closed with a failed CI run and no registry change.
- Both PRs have screenshots saved in the experiment directory.
scripts/mlops/compare_baseline.pyhas unit tests, all passing.
Pitfalls¶
- CI runs against
main, not the PR branch. GitHub Actions checks out the PR branch by default forpull_requestevents butmainforworkflow_dispatch. Usepull_request_targetcarefully — it can grant excessive permissions. Default: trigger onpull_request: labeledand run withpermissions: { contents: read, pull-requests: write }. - MLflow downloads are slow on cold cache. Cache the artifact download with
actions/cachekeyed byrun_id. Saves minutes per run. - DVC needs auth. If the DVC remote is private (e.g., S3), the workflow needs cloud credentials. For the curriculum, the remote is local; CI just runs
dvc fetch --remote localagainst a copy bundled with the workflow artifact. Document this in the lab README. - The baseline can be silently broken. If the eval set DVC hash in
eval_baseline.jsondoesn't match the pulled corpus, all comparisons are against the wrong eval. The workflow asserts this hash in stage 2 — verify with a deliberate mismatch (commit a baseline pointing at a stale eval hash; confirm CI fails fast). - Promotion isn't atomic across the registry + MLflow.
promote()updatestags.jsonand adds an MLflow tag. If one half fails, you have a half-state. Implement: write tags.json first (local, atomic via tmp+rename), then add the MLflow tag; on MLflow failure, roll back the local file and retry. Document the failure-mode in the lab README. - Promotion bot spam on retries. A failed workflow that's re-run shouldn't promote twice.
promote()is idempotent on (canonical_sha, semver) — verify with a re-run.
When to consult solutions/¶
After all six blocks. solutions/04-ci-deploy-gate-ref.md (phase open) reviews your workflow design, the comparison-logic edge cases, the action pinning, and the failure-rollback story.
This is the last Phase 38 lab. After completing all five (00–04), you have:
- A registry with stable canonical SHAs over MLflow + DVC.
- Shadow + A/B routing wired into the serving stack (no new src module).
- A drift detector with calibrated thresholds.
- A CpQU table per registered model.
- A CI gate that refuses to promote a regression.
Together: the MLOps spine for the Phase 39 capstone.
Next: Phase 38 report → Phase 39 capstone.