English · Español

Lab 04 — CI deploy gate: regression-block the grammar tutor¶

Goal: wire .github/workflows/deploy-grammar-tutor.yml. Verify it promotes a clean candidate; verify it refuses a deliberately-regressed candidate.

Estimated time: 3–5 hours.

Prereq: labs 00–03 done. eval_baseline.json committed at the repo root with per-bucket accuracy from the active production registry entry. MLflow tracking server reachable from the GitHub Actions runner (for the curriculum: a public read-only MLflow on the laptop, or in-CI MLflow with the artifacts pushed as a workflow artifact).

What you produce¶

experiments/38-ci-gate/ containing:

degraded_model/ — a copy of an existing registered checkpoint with a deliberate per-bucket regression injected (e.g., weights of the past-participle head perturbed; or a wrapper that randomizes 5% of past-participle outputs).
register_degraded.py — uploads the degraded model to MLflow, captures the run_id.
pr_clean.md — narrative + screenshot of the CI run that passed and promoted a clean candidate.
pr_regressed.md — narrative + screenshot of the CI run that failed on the regressed candidate.
manifest.json.

Plus, outside the experiment directory:

.github/workflows/deploy-grammar-tutor.yml — the CI workflow.
scripts/mlops/compare_baseline.py — the comparison logic invoked by the workflow.
eval_baseline.json — committed baseline (may already exist after lab 00).

TODOs¶

Block A — write the workflow¶

Create .github/workflows/deploy-grammar-tutor.yml. Trigger: PR labeled deploy-candidate with a comment containing mlflow_run_id=<id> semver=v0.X.Y.
Six stages per theory/05 Part 1:
actions/checkout@v4 + astral-sh/setup-uv@v3 + uv sync --frozen.
dvc pull data/eval/phase-20.jsonl.dvc (and the corpus if needed). Assert dvc hash matches eval_baseline.json["eval_set_dvc_hash"].
mlflow artifacts download --run-id ${{ inputs.run_id }} --dst-path ./candidate/.
python -m minieval --bundle ./candidate/ --eval-set data/eval/phase-20.jsonl --output candidate_eval.json.
python -m scripts.mlops.compare_baseline --candidate candidate_eval.json --baseline eval_baseline.json --tolerance 0.02. Exit 0 = pass, non-zero = fail.
On pass: python -m scripts.mlops.registry promote --canonical-sha <derived from candidate> --semver <from PR comment>. Then gh pr comment with the promoted SHA + semver. On fail: gh pr comment with the failing buckets and exit non-zero.
Pin every action to a SHA, not a tag (@v4 → @<sha>). Supply-chain hygiene (cross-reference security/supply-chain.md).

Block B — write the comparison logic¶

scripts/mlops/compare_baseline.py:
Loads candidate_eval.json and eval_baseline.json.
For each bucket in the baseline, checks candidate.bucket.accuracy >= baseline.bucket.accuracy - tolerance_pp.
If any bucket fails, prints a markdown-formatted table of failing buckets to stdout and exits with code 2.
If all buckets pass, prints a one-line summary and exits 0.
Unit-test it: synthetic candidate + baseline JSONs, both pass and fail cases. Tests live in tests/mlops/test_compare_baseline.py.

Block C — clean-candidate dry run¶

Open a PR titled chore: deploy v0.3.1 (rerun of LoRA grammar tutor, no model change). Add label deploy-candidate. Comment mlflow_run_id=<existing_lora_run_id> semver=v0.3.1.
Watch the workflow run. Expected behavior: stage 5 passes (no regression vs baseline), stage 6 promotes. Confirm the registry's index.jsonl has a new line and tags.json has the new semver.
Screenshot the green CI run + the promoted-comment from the bot.
Save as pr_clean.md in the experiment directory.

Block D — regression-injection¶

Take an existing registered model (the LoRA grammar tutor is a good candidate). Apply a targeted perturbation:
Option 1 (simpler): wrap the model in degraded_model/wrapper.py that randomly mangles 10% of past-participle outputs. This is easier than perturbing weights but is observable as a regression by the eval gate.
Option 2 (deeper): add Gaussian noise to the LoRA adapter's B matrix for the past-participle decoder slice. More realistic but slower to inject.
Run the eval locally to confirm: aggregate accuracy may stay close to baseline, but the past-participle bucket drops by > 2pp.
Upload to MLflow as a new run; capture the run_id.

Block E — regressed-candidate dry run¶

Open a PR titled chore: deploy v0.3.2 (degraded — should fail CI). Add label deploy-candidate. Comment mlflow_run_id=<degraded_run_id> semver=v0.3.2.
Watch the workflow run. Expected behavior: stage 5 fails because tense.past_participle regresses by more than 2pp. The bot comments on the PR with the failing bucket. Stage 6 does not execute. The registry does not change.
Screenshot the red CI run + the failing-bucket comment.
Save as pr_regressed.md in the experiment directory.
Close the PR without merging — the point is that the gate worked, not that we want this PR's changes in main.

Block F — `Justfile` recipes + manifest¶

Add just register-model <run-id> <semver> to invoke the local (dev-mode) promotion path with a warning that the production path is CI.
Add just compare-baseline <candidate.json> to dry-run the comparison locally.
manifest.json lists: the clean run_id used for Block C, the degraded run_id used for Block E, the eval_baseline.json SHA at the time of each PR.

Constraints¶

No --force flag in the production code path. If you find yourself adding one, stop. Local dev has LYNX_ENV=dev mode; production has CI. There is no third path.
Pinned GitHub Actions. Every action used in the workflow is pinned by commit SHA, not by tag. Supply-chain rule.
No secrets in the workflow. MLflow access uses repo-level GitHub Actions secrets (set in the repo settings, not committed). Document the secret names in pr_clean.md.
The baseline is committed. eval_baseline.json is a tracked file. Updating it is its own PR.
No new src/<module>/. compare_baseline.py and registry promote live in scripts/mlops/.

Stop conditions¶

Done when:

The workflow file exists and is syntactically valid (gh workflow list shows it).
The clean-candidate PR closed with a successful CI run and a new registry entry.
The regressed-candidate PR closed with a failed CI run and no registry change.
Both PRs have screenshots saved in the experiment directory.
scripts/mlops/compare_baseline.py has unit tests, all passing.

Pitfalls¶

CI runs against main, not the PR branch. GitHub Actions checks out the PR branch by default for pull_request events but main for workflow_dispatch. Use pull_request_target carefully — it can grant excessive permissions. Default: trigger on pull_request: labeled and run with permissions: { contents: read, pull-requests: write }.
MLflow downloads are slow on cold cache. Cache the artifact download with actions/cache keyed by run_id. Saves minutes per run.
DVC needs auth. If the DVC remote is private (e.g., S3), the workflow needs cloud credentials. For the curriculum, the remote is local; CI just runs dvc fetch --remote local against a copy bundled with the workflow artifact. Document this in the lab README.
The baseline can be silently broken. If the eval set DVC hash in eval_baseline.json doesn't match the pulled corpus, all comparisons are against the wrong eval. The workflow asserts this hash in stage 2 — verify with a deliberate mismatch (commit a baseline pointing at a stale eval hash; confirm CI fails fast).
Promotion isn't atomic across the registry + MLflow. promote() updates tags.json and adds an MLflow tag. If one half fails, you have a half-state. Implement: write tags.json first (local, atomic via tmp+rename), then add the MLflow tag; on MLflow failure, roll back the local file and retry. Document the failure-mode in the lab README.
Promotion bot spam on retries. A failed workflow that's re-run shouldn't promote twice. promote() is idempotent on (canonical_sha, semver) — verify with a re-run.

When to consult `solutions/`¶

After all six blocks. solutions/04-ci-deploy-gate-ref.md (phase open) reviews your workflow design, the comparison-logic edge cases, the action pinning, and the failure-rollback story.

This is the last Phase 38 lab. After completing all five (00–04), you have:

A registry with stable canonical SHAs over MLflow + DVC.
Shadow + A/B routing wired into the serving stack (no new src module).
A drift detector with calibrated thresholds.
A CpQU table per registered model.
A CI gate that refuses to promote a regression.

Together: the MLOps spine for the Phase 39 capstone.

Next: Phase 38 report → Phase 39 capstone.