A postmortem that assigns blame fixes nothing. Here is the blameless postmortem template that senior SREs actually use to find root causes and prevent recurrence.
Status: DRAFT
The payment gateway was down for 23 minutes. Orders failed. Revenue was lost. The customer support queue spiked. By the time the incident was resolved, leadership wanted to know who was responsible.
The wrong answer: "Rahul pushed a bad config." The right answer: "Our deployment process allowed an unvalidated config to reach production without detection." The first answer punishes a person. The second answer fixes the system.
This is the difference between a blame postmortem and a blameless one — and it is the difference between a team that keeps making the same mistakes and one that actually improves.
When postmortems assign blame, three things happen reliably:
People stop being honest in future postmortems. They describe their own actions in vague, favorable terms. Details that would reveal systemic problems get left out.
The same incident happens again. Blaming Rahul doesn't fix the deployment process. The next engineer makes the same mistake for the same reason — because the system made it easy to make.
Fear replaces transparency. Engineers start hesitating before making changes, not because they are more careful, but because they are afraid of being the next person blamed when something breaks.
Blameless postmortems operate on a different assumption: given the information they had at the time, any reasonable engineer would have done the same thing. The question isn't "who made the mistake" — it's "what made it possible."
A postmortem document should be completeable within 24-48 hours of an incident. Not a week later when details are fuzzy, not a month later when it has been procrastinated into irrelevance.
Here is the template:
Incident Summary
One paragraph: what broke, when, how long, what was the customer impact. No ambiguity about severity.
Example: The checkout-api returned 503 errors for 23 minutes from 14:17 to 14:40 IST on March 14, 2026. Approximately 4,200 transactions failed. Estimated revenue impact: ₹18 lakhs.
Timeline
Chronological record of what happened. Use exact timestamps. Do not summarize — specificity is what makes this useful later.
13:58 Deployment v2.4.1 rolled out to prod via Argo CD14:09 First alert: checkout-api error rate exceeded 5% threshold14:12 On-call engineer (Priya) paged14:17 Error rate crossed 95% — effectively full outage14:19 Priya opens Grafana, sees DB connection errors14:24 Priya identifies connection pool config changed in v2.4.114:31 Rollback initiated to v2.4.014:40 Error rate returns to baseline, incident resolved14:41 Incident channel updated, stakeholders notifiedRoot Cause Analysis
This is the most important section. Use the "Five Whys" technique: start with the symptom and ask "why" until you reach a systemic cause.
The checkout API returned 503 errors. Why? The database connection pool was exhausted. Why? maxPoolSize was set to 5 in the new config. Why? The config value was changed in a PR without review. Why? The PR template did not flag connection pool config as sensitive. Why? We have no tagging system for sensitive configuration values. Root cause: no mechanism exists to identify and require reviewfor changes to sensitive infrastructure configuration.The fifth "why" is the one that points at a system problem, not a person problem. That is where the action items live.
Contributing Factors
Things that made the impact worse or the detection slower — not the root cause, but amplifying conditions:
What Went Well
This section is not optional and is not a formality. Acknowledging what worked correctly builds accurate understanding of the system and gives engineers credit for their actions during a stressful event.
Action Items
This is where postmortems succeed or fail. Vague action items die in backlogs. Specific action items get done.
| Action | Owner | Due Date | Priority |
|---|---|---|---|
| Add connection pool config to sensitive-fields registry | Vikram | March 21 | P1 |
| Require two-reviewer approval for sensitive config changes | Priya | March 24 | P1 |
| Implement canary deployment for all checkout-api releases | Dev team | April 4 | P2 |
| Lower alert threshold to 1% error rate with 2-min window | Monitoring team | March 21 | P1 |
| Update runbook with connection pool diagnostic steps | On-call rotation | March 28 | P2 |
Every action item has exactly one owner (not "the team"), a specific due date, and a priority. These get tracked in Jira or Linear — not in the postmortem document, which you will not look at again.
AI tools are now generating first-draft postmortem timelines automatically from alert history, deployment logs, and Slack incident channels. Incident.io, PagerDuty, and Rootly all have this capability.
The AI draft covers timeline construction accurately — it can correlate "deployment happened at 13:58" with "errors started at 14:09" from three different data sources. This saves 30-45 minutes of timeline reconstruction work.
What AI does not do well: root cause analysis and action items. The Five Whys requires human judgment about which contributing factors are genuinely systemic versus incidental. Action item quality requires knowing your team's capacity, backlog, and what has been tried before. These sections need a human.
The practical workflow: let AI draft the timeline, write the root cause and action items yourself, review the whole thing as a team.
The postmortem document is written before the meeting, not during it. The meeting is for discussing the analysis, challenging assumptions, and ensuring action item owners understand and accept their items.
Keep it to 45 minutes. If the incident was straightforward, 30 minutes is enough. If it requires more than an hour, schedule a follow-up rather than letting the original meeting drag on and lose focus.
Appoint a facilitator who was not the primary responder. The person who was in the middle of fighting the incident for two hours is not the right person to moderate an objective analysis of what happened.
Explicitly say "no blame" at the start of every meeting until it becomes the default culture. This sounds awkward. Do it anyway — the first two months of establishing blameless culture require active reinforcement.
Writing the postmortem a week later. Details fade fast. Write the timeline within 24 hours while events are fresh, even if the full analysis takes another day.
Action items without owners. "Team will investigate" means nobody will investigate. One name per item.
Focusing on process without fixing tooling. "Engineers need to be more careful when changing connection pool settings" is not a systemic fix. "The deployment system blocks config changes to flagged fields without two approvals" is.
Never reviewing whether action items were completed. Schedule a 15-minute check-in 4 weeks after the postmortem to verify that P1 items are done and P2 items are on track.
Level 1 Write postmortems after major incidentsLevel 2 Blameless framing, Five Whys analysisLevel 3 Action items tracked to completionLevel 4 Postmortems shared across teams (learning culture)Level 5 AI-assisted timeline, trend analysis across incidentsMost teams are at Level 1. Level 3 is where you actually prevent recurrence. Level 4 is where organizational reliability compounds — teams learn from each other's incidents, not just their own.
Start a postmortem template in your team's wiki now, before the next incident. A blank template waiting is infinitely better than trying to structure your thoughts after a stressful three-hour outage.
Make postmortems searchable. Tag every postmortem with the services involved, the root cause category (config error, deployment, dependency, capacity), and the severity. In eighteen months, when you have a second incident with the same root cause category, you want to find the first one immediately.
Share postmortems with the broader engineering org. Not to broadcast failures, but because a data platform team's postmortem about a config validation gap might prevent a payments team from hitting the same problem. Shared postmortems are your organization's institutional memory about how systems fail.
INFORMATION📚 **References & Further Reading** * [Google SRE Book — Chapter 15: Postmortem Culture](https://sre.google/sre-book/postmortem-culture/) - The foundational reference on blameless postmortems * [Atlassian Postmortem Guide](https://www.atlassian.com/incident-management/postmortem) - Practical templates and examples * [Incident.io](https://incident.io/) - AI-assisted incident management and postmortem tooling * [PagerDuty Postmortem Templates](https://postmortems.pagerduty.com/) - Open-source postmortem templates
Focus the Five Whys on why your system had no resilience against the vendor failure, not on the vendor itself. Questions like 'why did we have no circuit breaker?' or 'why was the vendor SLA not reflected in our own SLO budget?' surface actionable internal improvements. Vendor postmortems are their responsibility — yours is hardening your dependency architecture.
Track two metrics quarterly: repeat incident rate by root cause category (config error, dependency failure, capacity) and mean time between same-category incidents. If the same root cause category appears more than once in six months despite action items being closed, the action items addressed symptoms rather than the systemic cause. Use this signal to reopen the postmortem and re-examine the Five Whys depth.
Discussion0