Incident Response

Postmortem Template.

A structured template for writing blameless incident postmortems. Copy it to your clipboard or download the Markdown file.

postmortem.md

# Incident Postmortem: [Title]

**Date:** YYYY-MM-DD
**Duration:** HH:MM – HH:MM (X hours Y minutes)
**Severity:** SEV-1 / SEV-2 / SEV-3
**Author:** [Your Name]
**Status:** Draft / Final

---

## Summary

A brief (2-3 sentence) summary of what happened, the user impact,
and the resolution. This should be understandable by anyone in the
organization.

---

## Timeline

All times in UTC.

| Time | Event |
|------|-------|
| HH:MM | First alert triggered (e.g., CheckUpstream incident.created) |
| HH:MM | On-call engineer acknowledged |
| HH:MM | Root cause identified |
| HH:MM | Fix deployed |
| HH:MM | Service fully recovered |

---

## Root Cause

A detailed technical explanation of what caused the incident.
Include the chain of events that led to the failure.

- What component failed?
- Why did the existing safeguards not prevent this?
- Was this a latent issue or newly introduced?

---

## Impact

- **Users affected:** X% of total users / N users
- **Revenue impact:** $X or N/A
- **Data loss:** Yes / No (describe if yes)
- **SLA breach:** Yes / No
- **Error budget consumed:** X%

### Services affected

| Service | Status during incident | Duration |
|---------|----------------------|----------|
| Example API | Degraded | 45 min |
| Example Dashboard | Unavailable | 30 min |

---

## Resolution

What was done to resolve the incident? Include the specific fix
and any temporary mitigations applied.

1. Step one of the resolution
2. Step two of the resolution
3. Verification steps taken

---

## Action Items

| Priority | Action | Owner | Due Date | Status |
|----------|--------|-------|----------|--------|
| P0 | [Critical fix description] | @engineer | YYYY-MM-DD | Open |
| P1 | [Important improvement] | @engineer | YYYY-MM-DD | Open |
| P2 | [Nice-to-have hardening] | @engineer | YYYY-MM-DD | Open |

---

## Lessons Learned

### What went well

- List things that worked during incident response
- e.g., Alerts fired within 2 minutes of the issue
- e.g., Runbook was up-to-date and helpful

### What went poorly

- List things that did not work or slowed response
- e.g., Took 20 minutes to identify the affected service
- e.g., No rollback procedure was documented

### Where we got lucky

- List things that could have made this worse
- e.g., The incident happened during low-traffic hours

---

## Detection

How was this incident detected? Check all that apply:

- [ ] Automated monitoring / alerting (CheckUpstream, PagerDuty, etc.)
- [ ] Customer report
- [ ] Internal user report
- [ ] Scheduled health check
- [ ] Other: ___

**Time to detect:** X minutes
**Time to mitigate:** X minutes
**Time to resolve:** X minutes

Tips for effective postmortems.

Be blameless

Focus on systems and processes, not individuals. The goal is to learn and improve, not to assign blame.

Be specific

Include exact timestamps, error messages, and metrics. Vague postmortems lead to vague action items.

Follow up on action items

A postmortem without completed action items is just documentation. Track items to completion.

Share broadly

Postmortems are most valuable when shared across the organization. Others can learn from your incidents.