English · Español
Break 00 — Deploy without a rollback path; simulate a bad release; measure the recovery cost¶
🇪🇸 Si no preservas la imagen anterior, "rollback" deja de ser un flip atómico y se vuelve "reconstruir desde el commit anterior" — minutos a horas de servicio degradado. Este
/breakquita el paso que preserva la imagen previa, despliega una versión rota, y mide cuánto cuesta recuperarse.
What you'll do¶
Modify the release flow so the previous release's image is garbage-collected from the registry as soon as a new release is tagged (an "aggressive cleanup" anti-pattern). Then push a release that fails an obvious smoke test. Observe that just rollback cannot find the previous image and must rebuild from scratch.
Step 1 — Locate the release flow¶
Justfile # the `release` and `rollback` recipes
scripts/release.sh # the script that tags + pushes images
scripts/rollback.sh # the script that flips traffic
(File names approximate; locate the exact ones in the repo.)
Step 2 — Introduce the bug¶
In scripts/release.sh, the current behavior keeps the last 5 image tags in artifacts/. Change it to delete all but the current tag:
# OLD — preserve last 5 releases
just artifacts-prune --keep 5
# NEW (the broken version)
just artifacts-prune --keep 1 # only the current release survives
The release succeeds visually. The new image is up. CI is green.
Step 3 — Push a deliberately broken release¶
Apply a small intentional regression: in src/minitutor/agent.py, swap the order of correction and spanish in the response JSON. This breaks the API schema contract (Phase 30) — clients expecting {"correction": ..., "spanish": ...} get the keys in the wrong order, or worse, the values swapped if the client deserializes positionally.
# OLD
return {"correction": fix_en, "spanish": fix_es}
# NEW (the broken release content)
return {"spanish": fix_es, "correction": fix_en}
Tag and release:
$ just release 1.2.3
[release] image tutor-1.2.3 pushed
[release] artifacts-prune --keep 1
[release] removed tutor-1.2.2.tar
[release] removed tutor-1.2.1.tar
[release] removed tutor-1.2.0.tar
[release] now serving 1.2.3
Run the smoke test against the portal's /quiz/submit:
$ just smoke-portal
[smoke] FAIL: response field order differs from contract;
[smoke] FAIL: portal renderer shows Spanish gloss in the English correction slot
Step 4 — Attempt rollback¶
$ just rollback
[rollback] looking for previous release in artifacts/...
[rollback] ERROR: no prior release tag found (last available: tutor-1.2.3, current)
[rollback] would need to rebuild from previous commit. ETA: 8-12 minutes.
The rollback that was supposed to be a 30-second traffic flip is now a 10-minute rebuild + redeploy. During that 10 minutes, every student hitting the portal sees the broken response.
Step 5 — Record the break¶
learners/borja/phase-38/notes/breaks.md:
- bug-id: 38-01
concept: rollback path requires the previous artifact to exist
symptom: just rollback errors out with "no prior release tag found";
recovery now requires a full rebuild from the previous commit;
~10 minutes of broken service vs the expected ~30 seconds.
hidden_cause: release.sh prunes too aggressively (--keep 1);
the previous image is gone before rollback can use it.
hint_1: "What does --keep 1 mean? What does --keep 5 mean?"
hint_2: "What's the relationship between 'rollback is fast' and 'old artifacts exist'?"
hint_3: "Diff release.sh against the previous commit. What flag changed?"
fix_diff: restore --keep 5 in release.sh. Also revert the schema-swap in
agent.py to make the failing test go green.
Step 6 — Apply the fix¶
Two things to revert:
release.sh: restore--keep 5so the last 5 releases are preserved.agent.py: restore the original key order — this also makes the smoke test green.
After: just rollback is back to a 30-second flip. The bad release stays prunable, but its predecessor is always there.
What this teaches¶
Two intertwined lessons:
- Rollback is not a separate procedure — it's "deploy, pointing at an older tag". For that to be cheap, the older tag has to exist. Aggressive cleanup destroys the rollback option.
- The cost of bad releases is measured in MTTR, not in bug severity. A 10-second-to-fix bug with a 10-minute MTTR is worse than a 1-minute-to-fix bug with a 30-second MTTR.
The fix-line count is small; the operational discipline behind it is large.
Hard rules respected¶
- Two coupled changes — one in
release.sh, one inagent.py— but they jointly model a single "no rollback option + bad release" scenario, which is the actual production failure pattern. The bug-id captures both. - Reversible in ≤ 5 lines total.
- Observable: smoke test fails, rollback fails to find prior artifact, MTTR measured by stopwatch.
- No security CVE introduced.
- Tests not modified.
Next: when green, re-read ../theory/06-build-deploy-rollback-and-ci-matrix.md — the "step 5 must exist before step 4" rule is the one violated here.