Skip to content

Postmortem: [Incident Title]

1. Incident Summary

  • Date/Time: [Start Date and Time] - [End Date and Time] (Timezone)
  • Duration: [Duration of the Incident]
  • Impact: [Brief summary of the impact, e.g., "API requests failed for 90% of users"]
  • Services Affected: [List of affected services or systems]
  • Root Cause: [Brief explanation of the root cause]

2. Timeline of Events

Time (Timezone) Event Description
HH:MM [Event 1: Detection of the issue]
HH:MM [Event 2: Actions taken]
HH:MM [Event 3: Identification of root cause]
HH:MM [Event 4: Resolution applied]

3. Root Cause Analysis

  • Primary Cause: [Detailed technical explanation of what caused the issue]
  • Contributing Factors:
  • [Factor 1]
  • [Factor 2]

4. Impact Analysis

  • Users Affected: [Number/percentage of affected users]
  • Business Impact: [e.g., revenue loss, SLA breach, customer complaints]
  • Technical Impact: [e.g., downtime, performance degradation]

5. Mitigation and Recovery

  • Steps Taken: [Describe the actions taken to mitigate and resolve the incident]
  • Short-Term Fixes: [What was done to immediately restore service]

6. Action Items

Task Description Owner Priority Deadline
[Action item 1: Add monitoring] [Owner 1] [High/Low] [YYYY-MM-DD]
[Action item 2: Improve alert thresholds] [Owner 2] [Medium] [YYYY-MM-DD]

7. Lessons Learned

  • What Went Well:
  • [Point 1]
  • [Point 2]
  • What Didn't Go Well:
  • [Point 1]
  • [Point 2]
  • Improvements for the Future:
  • [Point 1]
  • [Point 2]

8. Supporting Data

  • Logs: [Link to relevant logs or attach snippets]
  • Metrics: [Include charts, graphs, or screenshots showing key metrics during the incident]
  • Diagrams: [Include architecture or flow diagrams if necessary]

9. Communication and Review

  • Stakeholders Notified: [List of stakeholders and communication methods used]
  • Postmortem Review Meeting: [Scheduled date and time for review]