English · Español

05 — A reusable post-mortem template; applied to a fictional §A13 tutor incident¶

🇪🇸 La plantilla de post-mortem aquí no es decorativa: cuatro secciones, cada una con un objetivo claro, encuadradas para ser blameless. Aplicarla a un incidente ficticio del tutor §A13 muestra cómo se rellena. Si saltas la plantilla, el mismo incidente recurre — eso es lo que demuestra el /break adjunto.

The template (reusable)¶

A blameless post-mortem has four named sections. Length cap per section in parentheses.

1. Situation (50-150 words)¶

What happened, what the user-facing impact was, when it started and ended (UTC). One paragraph. Read-out-loud test: if you read just this section to a stranger, they should know the headline.

Required artifacts: dashboard screenshot at peak impact, the alert that fired, the commit/release SHA in effect at incident start.

2. Timeline (chronological, UTC)¶

A list of timestamps and events. Each entry has a signal: a log line, a commit, an alert, a deploy. Avoid narrative-style "we then noticed" — use objective signals.

The point of the timeline is to make the detection delay and the mitigation delay visible. Both are measurable, both are improvable, both have action items.

3. Contributing factors (3-5 items)¶

Multiple. Not "the root cause". Real incidents have multiple contributing factors — process, tooling, design, human. Listing them as a chain ("A caused B caused C") is usually wrong; listing them as a set that combined badly is usually right.

Each item should be specific enough to be fixed (not "lack of testing" — that's a category; specify "the eval gate did not cover JSON key order").

4. Action items (concrete, owned, dated)¶

For each contributing factor, at most one concrete action item. Each item has:

A title (one line).
An owner (a person; "Borja" is fine for the curriculum).
A due date (a specific date, not "next quarter").
A success criterion (what "done" looks like; binary if possible).

Blameless framing. Action items target the system, not the person. Wrong: "Borja should be more careful when editing release.sh". Right: "Add a unit test that fails if release.sh's --keep flag is < 5".

The blameless lens. People making decisions under uncertainty with imperfect information will sometimes choose badly. The post-mortem documents the system that gave them imperfect information and fixes that. The person did the best they could with what they had — that is the assumption.

A fictional incident — applied to the §A13 grammar tutor¶

Title¶

INC-2026-08-14-tutor-spanish-gloss-missing — Spanish gloss missing from ~30% of tutor responses for 47 minutes.

1. Situation¶

On 2026-08-14 between 09:31 UTC and 10:18 UTC, the §A13 grammar tutor served HTTP 200 responses that lacked the spanish field in ~30% of cases. The portal's quiz UI rendered an empty Spanish gloss for affected items, confusing learners during a scheduled class of 28 students. No data loss. No security impact. Recovered via rollback to the previous release.

2. Timeline (UTC)¶

09:28  Release v2.4.0 deployed (commit a91b...). CI green. Eval gate passed
       at 0.94 (above 0.92 threshold). Bake period 10 minutes, completed
       clean.
09:31  First class begins. ~28 students start submitting quiz items.
09:34  Spike in portal log lines "spanish_gloss=null" — but no alert fires
       because the alert rule was on tutor 5xx errors, not on response
       content.
09:42  First student support ticket: "the Spanish is blank on my quiz".
09:47  Borja sees ticket. Checks dashboard. Latency normal; error rate 0%.
09:51  Borja reads the affected response JSON manually. `spanish` field
       is empty string in ~30% of responses. Tutor logs show the field
       is being set to "" on requests where the model's generation hit
       the EOS token before producing the gloss.
10:05  Decision: rollback to v2.3.4 (last known good). Confirmed previous
       image is present in registry (--keep 5 preserved it).
10:12  just rollback completes. Smoke test green.
10:18  Class verifies Spanish gloss is back. Class continues with 6 minutes
       of disruption-induced delay.

3. Contributing factors¶

Decode budget too tight. v2.4.0 reduced max_decode_tokens from 32 to 16 to improve p50 latency. The change passed CI because the eval set's responses fit in 16 tokens, but the real distribution has a long tail of longer Spanish glosses (e.g., going to write → voy a escribir un email a mi familia in some prompted variants).
Eval set did not include the long-tail prompts. The capstone eval (phase-39-capstone.yaml) has 5-12 items; the regression set is 300 items. Neither includes the prompts the portal's quiz module generates at runtime.
Alerting was 5xx-only. A 200-with-empty-field is a content failure that the existing alert taxonomy did not cover.
No content-shape check at the gate. The eval rubric (theory 06) scores per-axis but does not separately check "the Spanish field is non-empty" as a blocking gate.
Bake-period smoke did not include a sample of long-tail inputs. The bake's smoke test was 10 fixed prompts; all 10 fit in 16 tokens; the regression was invisible to the smoke.

4. Action items¶

Action	Owner	Due	Done when
Add `max_decode_tokens` to a manifest-pinned config; CI fails on drop > 4×	Borja	2026-08-21	Test in `tests/phase38/` that fails if max_decode_tokens drops below 24
Sample 50 long-tail prompts from portal logs into the regression set	Borja	2026-08-25	`tests/eval/regression_corpus.jsonl` extended; CI re-baselined
Add Prometheus alert: tutor_response_empty_field_total counter	Borja	2026-08-21	Counter exported; alert rule fires at > 1% for 5 min
Add "non-empty Spanish gloss" as a blocking sub-check in eval gate	Borja	2026-08-22	Eval gate fails any release where > 2% of responses have empty gloss
Expand bake-period smoke to include 5 sampled portal prompts	Borja	2026-08-23	Smoke test references real prompts; rerun on every deploy

Each action targets one specific factor. Each is binary-checkable. None of them target a person.

What the template forces¶

Three things, all valuable:

Detection delay surfaces. From timeline: alert delay = 0 (no alert fired; user ticket was the signal at +16 min). Mitigation delay = 41 min after first observable failure. That gap is the action item set.
Multiple factors stay visible. Listing 5 contributing factors prevents "we'll just add an alert and call it done" — every factor gets its own action.
Blameless framing prevents recurrence. "Borja should test more carefully" wouldn't have prevented the next equivalent incident; "the eval gate doesn't cover long-tail prompts" does.

When to write a post-mortem¶

For the curriculum: whenever a deploy required a rollback, or whenever a learner-facing failure exceeded 10 minutes. Smaller bumps go in the daily journal.

For a production system: industry convention is "any incident with measurable user impact". Pick a bar and stick to it; "we don't write postmortems for small ones" is how culture decays.

The `/break` exercise¶

break/00-break-skip-postmortem.md simulates skipping the post-mortem after the incident above and shows the same incident recurring three months later with a slightly different decode-budget regression. The cost is concrete: more minutes of class disruption, plus the meta-cost of "we knew this was a problem but didn't act on it."

The fix to the /break is not a code change. It's a process change: write the post-mortem.

Reference¶

Google SRE, "Postmortem Culture: Learning from Failure" (chapter 15 of Site Reliability Engineering, 2016). The blameless framing originates here.
Allspaw, "The Infinite Hows" (2014). Why "5 whys" is a bad framing; "contributing factors" is better.
Etsy, "Blameless PostMortems and a Just Culture" (2012). The cultural framework.

Next: ../break/00-break-skip-postmortem.md for the recurrence demonstration.