Skip to content

English · Español

Break 00 — Skip the post-mortem; the same incident recurs three months later

🇪🇸 Este /break no es un bug de código — es un bug de proceso. Después del incidente INC-2026-08-14 (gloss español ausente), nadie escribió post-mortem; tres meses después, una regresión equivalente vuelve a romper el tutor en clase. Coste: dos incidentes que podrían haber sido cero.


What you'll do

Walk through a simulated three-month timeline where the §A13 grammar tutor's INC-2026-08-14 incident (described in theory/05-postmortem-template-and-fictional-incident.md) is fixed operationally (rollback succeeds) but not followed up with a post-mortem. Three months later, a structurally identical incident recurs. Document the meta-cost.

This /break is diagnostic and pedagogical: there is no code to edit. The "bug" is the absence of process artifact. The "fix" is to write the post-mortem.

Step 1 — The first incident (Aug 14)

INC-2026-08-14-tutor-spanish-gloss-missing happens as described in theory 05:

  • Cause: max_decode_tokens cut from 32 to 16 in v2.4.0; long-tail Spanish glosses truncate to empty strings.
  • Detection: user ticket at +16 min.
  • Mitigation: rollback to v2.3.4 at +41 min.
  • Class disruption: ~6 minutes after rollback (re-attempt items).

Operationally fine: rollback worked, no data loss, students continued. The cohort is small and forgiving. Borja moves on to feature work for v2.5.0. No post-mortem is written.

The daily journal entry for 2026-08-14 mentions: "tutor incident — fixed via rollback, ~30 min." That's it.

Step 2 — Three months pass; the system drifts

Between Aug 14 and Nov, the following happens:

  • 2026-08-22: v2.4.1 releases. max_decode_tokens restored to 32 (as part of a different PR that touched config). No connection drawn to the prior incident.
  • 2026-09-10: v2.5.0 ships streaming responses. Decode loop refactored; the max_decode_tokens parameter is now plumbed through a new StreamingDecoderConfig object.
  • 2026-10-04: v2.6.0 adds a new "verbose explanation" mode that the portal toggles on for difficult items. The model is prompted to give multi-sentence explanations.
  • 2026-11-12: v2.7.0 — performance pass. Borja, profiling tail latency, sees that the verbose explanation mode pushes p95 to 800 ms. He reduces StreamingDecoderConfig.max_tokens from 64 to 32 to bring p95 down. CI green, eval gate at 0.93, bake clean. Ships.

Step 3 — The recurrence (Nov 13)

2026-11-13 14:01  v2.7.0 deployed. Bake clean.
2026-11-13 14:05  Class of 25 students starts a quiz session.
2026-11-13 14:09  Portal log spikes "explanation field truncated".
2026-11-13 14:14  Borja sees student tickets. Spends 8 minutes diagnosing.
                  "Wait — this looks like the August thing."
2026-11-13 14:22  Realization: max_tokens=32 truncates the new verbose
                  explanation mode (longest cases are 50+ tokens). Same
                  failure pattern as Aug, different field, same root cause.
2026-11-13 14:28  Rollback to v2.6.1. Class continues.

Same incident class. Different field this time (explanation not spanish), same root cause class (decode budget vs distribution mismatch), same detection mode (user tickets), same mitigation (rollback).

Step 4 — What the post-mortem would have caught

If a post-mortem had been written on Aug 14 with the action items from theory 05:

  1. max_decode_tokens manifest-pinned with CI guard → would have flagged the v2.7.0 decrease. ✗ not done.
  2. Long-tail prompts in regression set → the verbose explanation prompts would have been there. ✗ not done.
  3. Empty-field counter + alert → would have fired at 14:06 instead of 14:14 (8 min earlier). ✗ not done.
  4. Empty-field check as blocking eval gate → would have failed v2.7.0 at CI. ✗ not done.
  5. Bake includes sampled portal prompts → the bake would have caught it before user impact. ✗ not done.

Each of these is a few hours of work, on the day of Aug 14, when the incident was fresh. Done as a batch on Aug 14, the November recurrence does not happen.

Step 5 — The meta-cost

Cost Aug 14 Nov 13 Total
Class disruption (minutes) ~6 ~6 ~12
Engineer time to diagnose ~25 min ~13 min ~38 min
Engineer time for rollback + smoke ~10 min ~6 min ~16 min
Loss of student trust (qualitative) low medium medium
Action items still pending after Nov 13 0 5 5

The operational cost roughly doubles (two incidents instead of one). The latent cost is harder to quantify: 5 action items are still pending after Nov 13, the same root cause class can recur a third time, and the team learns to treat decode-budget regressions as "just rollback when they happen" rather than fixing the process.

The cost of not writing a post-mortem is the expected number of future recurrences × the cost of each recurrence. For an incident class that recurred once in three months, the per-quarter cost is at least one repeat incident. For low-trust contexts (paying customers, regulated industries), the cost multiplier is much higher.

Step 6 — The "fix": write the post-mortem (retroactively)

The action this /break motivates: open learners/borja/postmortems/INC-2026-08-14-tutor-spanish-gloss-missing.md and fill the four sections from theory/05-postmortem-template-and-fictional-incident.md. The fictional artifact is provided as a worked example; the exercise is to commit the discipline.

The fix to the code regression in v2.7.0 is mechanical (manifest-pin + counter + bake-smoke + regression-set extension). The fix to the process gap is the post-mortem itself.

Why this is the right /break for Phase 40

Phase 40 is "Hardening & post-mortem." The single most common failure mode of the post-mortem practice is not doing it. Doing it badly is rare; skipping it because "the operational fix worked" is universal. This exercise makes the cost of skipping concrete.

Hard rules respected

  • No code edited.
  • The fictional incidents are clearly labeled; nothing in production is affected.
  • The exercise produces a real artifact (the retroactive post-mortem) that lives in learners/borja/postmortems/.
  • No security implication.
  • No test modified.

Next: when the retroactive post-mortem is written, re-read ../theory/05-postmortem-template-and-fictional-incident.md and the canonical 01-postmortem-structure.md.