English · Español
05 — A reusable post-mortem template; applied to a fictional §A13 tutor incident¶
🇪🇸 La plantilla de post-mortem aquí no es decorativa: cuatro secciones, cada una con un objetivo claro, encuadradas para ser blameless. Aplicarla a un incidente ficticio del tutor §A13 muestra cómo se rellena. Si saltas la plantilla, el mismo incidente recurre — eso es lo que demuestra el
/breakadjunto.
The template (reusable)¶
A blameless post-mortem has four named sections. Length cap per section in parentheses.
1. Situation (50-150 words)¶
What happened, what the user-facing impact was, when it started and ended (UTC). One paragraph. Read-out-loud test: if you read just this section to a stranger, they should know the headline.
Required artifacts: dashboard screenshot at peak impact, the alert that fired, the commit/release SHA in effect at incident start.
2. Timeline (chronological, UTC)¶
A list of timestamps and events. Each entry has a signal: a log line, a commit, an alert, a deploy. Avoid narrative-style "we then noticed" — use objective signals.
The point of the timeline is to make the detection delay and the mitigation delay visible. Both are measurable, both are improvable, both have action items.
3. Contributing factors (3-5 items)¶
Multiple. Not "the root cause". Real incidents have multiple contributing factors — process, tooling, design, human. Listing them as a chain ("A caused B caused C") is usually wrong; listing them as a set that combined badly is usually right.
Each item should be specific enough to be fixed (not "lack of testing" — that's a category; specify "the eval gate did not cover JSON key order").
4. Action items (concrete, owned, dated)¶
For each contributing factor, at most one concrete action item. Each item has:
- A title (one line).
- An owner (a person; "Borja" is fine for the curriculum).
- A due date (a specific date, not "next quarter").
- A success criterion (what "done" looks like; binary if possible).
Blameless framing. Action items target the system, not the person. Wrong: "Borja should be more careful when editing release.sh". Right: "Add a unit test that fails if release.sh's --keep flag is < 5".
The blameless lens. People making decisions under uncertainty with imperfect information will sometimes choose badly. The post-mortem documents the system that gave them imperfect information and fixes that. The person did the best they could with what they had — that is the assumption.
A fictional incident — applied to the §A13 grammar tutor¶
Title¶
INC-2026-08-14-tutor-spanish-gloss-missing — Spanish gloss missing from ~30% of tutor responses for 47 minutes.
1. Situation¶
On 2026-08-14 between 09:31 UTC and 10:18 UTC, the §A13 grammar tutor served HTTP 200 responses that lacked the spanish field in ~30% of cases. The portal's quiz UI rendered an empty Spanish gloss for affected items, confusing learners during a scheduled class of 28 students. No data loss. No security impact. Recovered via rollback to the previous release.
2. Timeline (UTC)¶
09:28 Release v2.4.0 deployed (commit a91b...). CI green. Eval gate passed
at 0.94 (above 0.92 threshold). Bake period 10 minutes, completed
clean.
09:31 First class begins. ~28 students start submitting quiz items.
09:34 Spike in portal log lines "spanish_gloss=null" — but no alert fires
because the alert rule was on tutor 5xx errors, not on response
content.
09:42 First student support ticket: "the Spanish is blank on my quiz".
09:47 Borja sees ticket. Checks dashboard. Latency normal; error rate 0%.
09:51 Borja reads the affected response JSON manually. `spanish` field
is empty string in ~30% of responses. Tutor logs show the field
is being set to "" on requests where the model's generation hit
the EOS token before producing the gloss.
10:05 Decision: rollback to v2.3.4 (last known good). Confirmed previous
image is present in registry (--keep 5 preserved it).
10:12 just rollback completes. Smoke test green.
10:18 Class verifies Spanish gloss is back. Class continues with 6 minutes
of disruption-induced delay.
3. Contributing factors¶
- Decode budget too tight. v2.4.0 reduced
max_decode_tokensfrom 32 to 16 to improve p50 latency. The change passed CI because the eval set's responses fit in 16 tokens, but the real distribution has a long tail of longer Spanish glosses (e.g.,going to write→voy a escribir un email a mi familiain some prompted variants). - Eval set did not include the long-tail prompts. The capstone eval (
phase-39-capstone.yaml) has 5-12 items; the regression set is 300 items. Neither includes the prompts the portal's quiz module generates at runtime. - Alerting was 5xx-only. A 200-with-empty-field is a content failure that the existing alert taxonomy did not cover.
- No content-shape check at the gate. The eval rubric (theory 06) scores per-axis but does not separately check "the Spanish field is non-empty" as a blocking gate.
- Bake-period smoke did not include a sample of long-tail inputs. The bake's smoke test was 10 fixed prompts; all 10 fit in 16 tokens; the regression was invisible to the smoke.
4. Action items¶
| Action | Owner | Due | Done when |
|---|---|---|---|
Add max_decode_tokens to a manifest-pinned config; CI fails on drop > 4× |
Borja | 2026-08-21 | Test in tests/phase38/ that fails if max_decode_tokens drops below 24 |
| Sample 50 long-tail prompts from portal logs into the regression set | Borja | 2026-08-25 | tests/eval/regression_corpus.jsonl extended; CI re-baselined |
| Add Prometheus alert: tutor_response_empty_field_total counter | Borja | 2026-08-21 | Counter exported; alert rule fires at > 1% for 5 min |
| Add "non-empty Spanish gloss" as a blocking sub-check in eval gate | Borja | 2026-08-22 | Eval gate fails any release where > 2% of responses have empty gloss |
| Expand bake-period smoke to include 5 sampled portal prompts | Borja | 2026-08-23 | Smoke test references real prompts; rerun on every deploy |
Each action targets one specific factor. Each is binary-checkable. None of them target a person.
What the template forces¶
Three things, all valuable:
- Detection delay surfaces. From timeline: alert delay = 0 (no alert fired; user ticket was the signal at +16 min). Mitigation delay = 41 min after first observable failure. That gap is the action item set.
- Multiple factors stay visible. Listing 5 contributing factors prevents "we'll just add an alert and call it done" — every factor gets its own action.
- Blameless framing prevents recurrence. "Borja should test more carefully" wouldn't have prevented the next equivalent incident; "the eval gate doesn't cover long-tail prompts" does.
When to write a post-mortem¶
For the curriculum: whenever a deploy required a rollback, or whenever a learner-facing failure exceeded 10 minutes. Smaller bumps go in the daily journal.
For a production system: industry convention is "any incident with measurable user impact". Pick a bar and stick to it; "we don't write postmortems for small ones" is how culture decays.
The /break exercise¶
break/00-break-skip-postmortem.md simulates skipping the post-mortem after the incident above and shows the same incident recurring three months later with a slightly different decode-budget regression. The cost is concrete: more minutes of class disruption, plus the meta-cost of "we knew this was a problem but didn't act on it."
The fix to the /break is not a code change. It's a process change: write the post-mortem.
Reference¶
- Google SRE, "Postmortem Culture: Learning from Failure" (chapter 15 of Site Reliability Engineering, 2016). The blameless framing originates here.
- Allspaw, "The Infinite Hows" (2014). Why "5 whys" is a bad framing; "contributing factors" is better.
- Etsy, "Blameless PostMortems and a Just Culture" (2012). The cultural framework.
Next: ../break/00-break-skip-postmortem.md for the recurrence demonstration.