Postmortem: [Incident Title]
1. Incident Summary
- Date/Time: [Start Date and Time] - [End Date and Time] (Timezone)
- Duration: [Duration of the Incident]
- Impact: [Brief summary of the impact, e.g., "API requests failed for 90% of users"]
- Services Affected: [List of affected services or systems]
- Root Cause: [Brief explanation of the root cause]
2. Timeline of Events
Time (Timezone) | Event Description |
---|---|
HH:MM | [Event 1: Detection of the issue] |
HH:MM | [Event 2: Actions taken] |
HH:MM | [Event 3: Identification of root cause] |
HH:MM | [Event 4: Resolution applied] |
3. Root Cause Analysis
- Primary Cause: [Detailed technical explanation of what caused the issue]
- Contributing Factors:
- [Factor 1]
- [Factor 2]
4. Impact Analysis
- Users Affected: [Number/percentage of affected users]
- Business Impact: [e.g., revenue loss, SLA breach, customer complaints]
- Technical Impact: [e.g., downtime, performance degradation]
5. Mitigation and Recovery
- Steps Taken: [Describe the actions taken to mitigate and resolve the incident]
- Short-Term Fixes: [What was done to immediately restore service]
6. Action Items
Task Description | Owner | Priority | Deadline |
---|---|---|---|
[Action item 1: Add monitoring] | [Owner 1] | [High/Low] | [YYYY-MM-DD] |
[Action item 2: Improve alert thresholds] | [Owner 2] | [Medium] | [YYYY-MM-DD] |
7. Lessons Learned
- What Went Well:
- [Point 1]
- [Point 2]
- What Didn't Go Well:
- [Point 1]
- [Point 2]
- Improvements for the Future:
- [Point 1]
- [Point 2]
8. Supporting Data
- Logs: [Link to relevant logs or attach snippets]
- Metrics: [Include charts, graphs, or screenshots showing key metrics during the incident]
- Diagrams: [Include architecture or flow diagrams if necessary]
9. Communication and Review
- Stakeholders Notified: [List of stakeholders and communication methods used]
- Postmortem Review Meeting: [Scheduled date and time for review]