Skip to content

English · Español

Lab 04 — CI deploy gate: regression-block the grammar tutor

Goal: wire .github/workflows/deploy-grammar-tutor.yml. Verify it promotes a clean candidate; verify it refuses a deliberately-regressed candidate.

Estimated time: 3–5 hours.

Prereq: labs 00–03 done. eval_baseline.json committed at the repo root with per-bucket accuracy from the active production registry entry. MLflow tracking server reachable from the GitHub Actions runner (for the curriculum: a public read-only MLflow on the laptop, or in-CI MLflow with the artifacts pushed as a workflow artifact).


What you produce

experiments/38-ci-gate/ containing:

  • degraded_model/ — a copy of an existing registered checkpoint with a deliberate per-bucket regression injected (e.g., weights of the past-participle head perturbed; or a wrapper that randomizes 5% of past-participle outputs).
  • register_degraded.py — uploads the degraded model to MLflow, captures the run_id.
  • pr_clean.md — narrative + screenshot of the CI run that passed and promoted a clean candidate.
  • pr_regressed.md — narrative + screenshot of the CI run that failed on the regressed candidate.
  • manifest.json.

Plus, outside the experiment directory:

  • .github/workflows/deploy-grammar-tutor.yml — the CI workflow.
  • scripts/mlops/compare_baseline.py — the comparison logic invoked by the workflow.
  • eval_baseline.json — committed baseline (may already exist after lab 00).

TODOs

Block A — write the workflow

  • Create .github/workflows/deploy-grammar-tutor.yml. Trigger: PR labeled deploy-candidate with a comment containing mlflow_run_id=<id> semver=v0.X.Y.
  • Six stages per theory/05 Part 1:
  • actions/checkout@v4 + astral-sh/setup-uv@v3 + uv sync --frozen.
  • dvc pull data/eval/phase-20.jsonl.dvc (and the corpus if needed). Assert dvc hash matches eval_baseline.json["eval_set_dvc_hash"].
  • mlflow artifacts download --run-id ${{ inputs.run_id }} --dst-path ./candidate/.
  • python -m minieval --bundle ./candidate/ --eval-set data/eval/phase-20.jsonl --output candidate_eval.json.
  • python -m scripts.mlops.compare_baseline --candidate candidate_eval.json --baseline eval_baseline.json --tolerance 0.02. Exit 0 = pass, non-zero = fail.
  • On pass: python -m scripts.mlops.registry promote --canonical-sha <derived from candidate> --semver <from PR comment>. Then gh pr comment with the promoted SHA + semver. On fail: gh pr comment with the failing buckets and exit non-zero.
  • Pin every action to a SHA, not a tag (@v4@<sha>). Supply-chain hygiene (cross-reference security/supply-chain.md).

Block B — write the comparison logic

  • scripts/mlops/compare_baseline.py:
  • Loads candidate_eval.json and eval_baseline.json.
  • For each bucket in the baseline, checks candidate.bucket.accuracy >= baseline.bucket.accuracy - tolerance_pp.
  • If any bucket fails, prints a markdown-formatted table of failing buckets to stdout and exits with code 2.
  • If all buckets pass, prints a one-line summary and exits 0.
  • Unit-test it: synthetic candidate + baseline JSONs, both pass and fail cases. Tests live in tests/mlops/test_compare_baseline.py.

Block C — clean-candidate dry run

  • Open a PR titled chore: deploy v0.3.1 (rerun of LoRA grammar tutor, no model change). Add label deploy-candidate. Comment mlflow_run_id=<existing_lora_run_id> semver=v0.3.1.
  • Watch the workflow run. Expected behavior: stage 5 passes (no regression vs baseline), stage 6 promotes. Confirm the registry's index.jsonl has a new line and tags.json has the new semver.
  • Screenshot the green CI run + the promoted-comment from the bot.
  • Save as pr_clean.md in the experiment directory.

Block D — regression-injection

  • Take an existing registered model (the LoRA grammar tutor is a good candidate). Apply a targeted perturbation:
  • Option 1 (simpler): wrap the model in degraded_model/wrapper.py that randomly mangles 10% of past-participle outputs. This is easier than perturbing weights but is observable as a regression by the eval gate.
  • Option 2 (deeper): add Gaussian noise to the LoRA adapter's B matrix for the past-participle decoder slice. More realistic but slower to inject.
  • Run the eval locally to confirm: aggregate accuracy may stay close to baseline, but the past-participle bucket drops by > 2pp.
  • Upload to MLflow as a new run; capture the run_id.

Block E — regressed-candidate dry run

  • Open a PR titled chore: deploy v0.3.2 (degraded — should fail CI). Add label deploy-candidate. Comment mlflow_run_id=<degraded_run_id> semver=v0.3.2.
  • Watch the workflow run. Expected behavior: stage 5 fails because tense.past_participle regresses by more than 2pp. The bot comments on the PR with the failing bucket. Stage 6 does not execute. The registry does not change.
  • Screenshot the red CI run + the failing-bucket comment.
  • Save as pr_regressed.md in the experiment directory.
  • Close the PR without merging — the point is that the gate worked, not that we want this PR's changes in main.

Block F — Justfile recipes + manifest

  • Add just register-model <run-id> <semver> to invoke the local (dev-mode) promotion path with a warning that the production path is CI.
  • Add just compare-baseline <candidate.json> to dry-run the comparison locally.
  • manifest.json lists: the clean run_id used for Block C, the degraded run_id used for Block E, the eval_baseline.json SHA at the time of each PR.

Constraints

  • No --force flag in the production code path. If you find yourself adding one, stop. Local dev has LYNX_ENV=dev mode; production has CI. There is no third path.
  • Pinned GitHub Actions. Every action used in the workflow is pinned by commit SHA, not by tag. Supply-chain rule.
  • No secrets in the workflow. MLflow access uses repo-level GitHub Actions secrets (set in the repo settings, not committed). Document the secret names in pr_clean.md.
  • The baseline is committed. eval_baseline.json is a tracked file. Updating it is its own PR.
  • No new src/<module>/. compare_baseline.py and registry promote live in scripts/mlops/.

Stop conditions

Done when:

  1. The workflow file exists and is syntactically valid (gh workflow list shows it).
  2. The clean-candidate PR closed with a successful CI run and a new registry entry.
  3. The regressed-candidate PR closed with a failed CI run and no registry change.
  4. Both PRs have screenshots saved in the experiment directory.
  5. scripts/mlops/compare_baseline.py has unit tests, all passing.

Pitfalls

  • CI runs against main, not the PR branch. GitHub Actions checks out the PR branch by default for pull_request events but main for workflow_dispatch. Use pull_request_target carefully — it can grant excessive permissions. Default: trigger on pull_request: labeled and run with permissions: { contents: read, pull-requests: write }.
  • MLflow downloads are slow on cold cache. Cache the artifact download with actions/cache keyed by run_id. Saves minutes per run.
  • DVC needs auth. If the DVC remote is private (e.g., S3), the workflow needs cloud credentials. For the curriculum, the remote is local; CI just runs dvc fetch --remote local against a copy bundled with the workflow artifact. Document this in the lab README.
  • The baseline can be silently broken. If the eval set DVC hash in eval_baseline.json doesn't match the pulled corpus, all comparisons are against the wrong eval. The workflow asserts this hash in stage 2 — verify with a deliberate mismatch (commit a baseline pointing at a stale eval hash; confirm CI fails fast).
  • Promotion isn't atomic across the registry + MLflow. promote() updates tags.json and adds an MLflow tag. If one half fails, you have a half-state. Implement: write tags.json first (local, atomic via tmp+rename), then add the MLflow tag; on MLflow failure, roll back the local file and retry. Document the failure-mode in the lab README.
  • Promotion bot spam on retries. A failed workflow that's re-run shouldn't promote twice. promote() is idempotent on (canonical_sha, semver) — verify with a re-run.

When to consult solutions/

After all six blocks. solutions/04-ci-deploy-gate-ref.md (phase open) reviews your workflow design, the comparison-logic edge cases, the action pinning, and the failure-rollback story.


This is the last Phase 38 lab. After completing all five (00–04), you have:

  • A registry with stable canonical SHAs over MLflow + DVC.
  • Shadow + A/B routing wired into the serving stack (no new src module).
  • A drift detector with calibrated thresholds.
  • A CpQU table per registered model.
  • A CI gate that refuses to promote a regression.

Together: the MLOps spine for the Phase 39 capstone.

Next: Phase 38 report → Phase 39 capstone.