Incident Title:
Provide a clear title summarizing the issue.
Date of Incident:
YYYY-MM-DD
Reported By:
Name/Team
Incident Owner:
Name/Team responsible for resolution and RCA
Incident Severity:
Sev 1 / Sev 2 / Sev 3
A brief summary of what happened. Include what was affected, the impact, and high-level timeline.
| Time (UTC) | Event Description |
|---|---|
| 00:00 | Incident started |
| 00:15 | Alert triggered |
| 00:30 | Investigation began |
| 01:00 | Root cause identified |
| 01:30 | Fix applied |
| 02:00 | Systems recovered |
Describe the root cause in detail. Include contributing factors and why the issue wasn’t caught earlier.
List the steps taken to mitigate and resolve the issue.
What long-term fix is implemented or planned to prevent recurrence?
What was learned from this incident? What went well? What could’ve been better?
| Action Item | Owner | Due Date | Status |
|---|---|---|---|
| Add new CloudWatch alarm | DevOps Team | 2025-03-25 | In Progress |
| Review auto-scaling configuration | Cloud Team | 2025-03-27 | Planned |