Skip to content

English · Español

Break 00 — Deploy without a rollback path; simulate a bad release; measure the recovery cost

🇪🇸 Si no preservas la imagen anterior, "rollback" deja de ser un flip atómico y se vuelve "reconstruir desde el commit anterior" — minutos a horas de servicio degradado. Este /break quita el paso que preserva la imagen previa, despliega una versión rota, y mide cuánto cuesta recuperarse.


What you'll do

Modify the release flow so the previous release's image is garbage-collected from the registry as soon as a new release is tagged (an "aggressive cleanup" anti-pattern). Then push a release that fails an obvious smoke test. Observe that just rollback cannot find the previous image and must rebuild from scratch.

Step 1 — Locate the release flow

Justfile                          # the `release` and `rollback` recipes
scripts/release.sh                # the script that tags + pushes images
scripts/rollback.sh               # the script that flips traffic

(File names approximate; locate the exact ones in the repo.)

Step 2 — Introduce the bug

In scripts/release.sh, the current behavior keeps the last 5 image tags in artifacts/. Change it to delete all but the current tag:

# OLD — preserve last 5 releases
just artifacts-prune --keep 5

# NEW (the broken version)
just artifacts-prune --keep 1     # only the current release survives

The release succeeds visually. The new image is up. CI is green.

Step 3 — Push a deliberately broken release

Apply a small intentional regression: in src/minitutor/agent.py, swap the order of correction and spanish in the response JSON. This breaks the API schema contract (Phase 30) — clients expecting {"correction": ..., "spanish": ...} get the keys in the wrong order, or worse, the values swapped if the client deserializes positionally.

# OLD
return {"correction": fix_en, "spanish": fix_es}

# NEW (the broken release content)
return {"spanish": fix_es, "correction": fix_en}

Tag and release:

$ just release 1.2.3
[release] image tutor-1.2.3 pushed
[release] artifacts-prune --keep 1
[release] removed tutor-1.2.2.tar
[release] removed tutor-1.2.1.tar
[release] removed tutor-1.2.0.tar
[release] now serving 1.2.3

Run the smoke test against the portal's /quiz/submit:

$ just smoke-portal
[smoke] FAIL: response field order differs from contract;
[smoke] FAIL: portal renderer shows Spanish gloss in the English correction slot

Step 4 — Attempt rollback

$ just rollback
[rollback] looking for previous release in artifacts/...
[rollback] ERROR: no prior release tag found (last available: tutor-1.2.3, current)
[rollback] would need to rebuild from previous commit. ETA: 8-12 minutes.

The rollback that was supposed to be a 30-second traffic flip is now a 10-minute rebuild + redeploy. During that 10 minutes, every student hitting the portal sees the broken response.

Step 5 — Record the break

learners/borja/phase-38/notes/breaks.md:

- bug-id: 38-01
  concept: rollback path requires the previous artifact to exist
  symptom: just rollback errors out with "no prior release tag found";
           recovery now requires a full rebuild from the previous commit;
           ~10 minutes of broken service vs the expected ~30 seconds.
  hidden_cause: release.sh prunes too aggressively (--keep 1);
                the previous image is gone before rollback can use it.
  hint_1: "What does --keep 1 mean? What does --keep 5 mean?"
  hint_2: "What's the relationship between 'rollback is fast' and 'old artifacts exist'?"
  hint_3: "Diff release.sh against the previous commit. What flag changed?"
  fix_diff: restore --keep 5 in release.sh. Also revert the schema-swap in
            agent.py to make the failing test go green.

Step 6 — Apply the fix

Two things to revert:

  1. release.sh: restore --keep 5 so the last 5 releases are preserved.
  2. agent.py: restore the original key order — this also makes the smoke test green.

After: just rollback is back to a 30-second flip. The bad release stays prunable, but its predecessor is always there.

What this teaches

Two intertwined lessons:

  1. Rollback is not a separate procedure — it's "deploy, pointing at an older tag". For that to be cheap, the older tag has to exist. Aggressive cleanup destroys the rollback option.
  2. The cost of bad releases is measured in MTTR, not in bug severity. A 10-second-to-fix bug with a 10-minute MTTR is worse than a 1-minute-to-fix bug with a 30-second MTTR.

The fix-line count is small; the operational discipline behind it is large.

Hard rules respected

  • Two coupled changes — one in release.sh, one in agent.py — but they jointly model a single "no rollback option + bad release" scenario, which is the actual production failure pattern. The bug-id captures both.
  • Reversible in ≤ 5 lines total.
  • Observable: smoke test fails, rollback fails to find prior artifact, MTTR measured by stopwatch.
  • No security CVE introduced.
  • Tests not modified.

Next: when green, re-read ../theory/06-build-deploy-rollback-and-ci-matrix.md — the "step 5 must exist before step 4" rule is the one violated here.