We managed our first major incident the way most teams do — chaos. Three people SSHing into production simultaneously, conflicting Slack threads, no one sure who was in charge, and a customer finding out about the outage from Twitter before we told them. It took 4 hours to resolve a problem that should have taken 30 minutes. That's when we built our incident management process.
What We'll Cover
Severity Classification: Agree Before the Incident
During an incident is the worst time to debate "how bad is this?" Define severity levels upfront so the response is automatic.
| Severity | Impact | Examples | Response Time | Who Gets Paged |
|---|---|---|---|---|
| SEV-1 (Critical) | Total service outage or data loss | Site down, payment processing broken, data breach | Immediate (24/7) | On-call + engineering lead + VP Engineering |
| SEV-2 (Major) | Major feature broken, degraded for many users | Search not working, login failures for 30% of users, API errors spiking | Within 30 minutes | On-call + team lead |
| SEV-3 (Minor) | Feature degraded for some users | Slow loading for one region, email notifications delayed, minor UI bug | Within 4 hours (business hours) | On-call |
| SEV-4 (Low) | Cosmetic or non-customer-impacting issue | Internal tool broken, monitoring alert noise, non-critical cron failure | Next business day | Filed as ticket |
The most common mistake: everything is SEV-1. If every incident is critical, none are. We've seen teams where "the admin dashboard is slow" triggers the same response as "payments are down." That's how alert fatigue starts.
Incident Roles: Who Does What
Three roles. No more. Assign them in the first 2 minutes.
Incident Commander (IC)
Coordinates the response. Doesn't debug — that's deliberate. The IC's job is to ask questions, track progress, make decisions (escalate? rollback? communicate?), and keep the team focused. Think air traffic controller, not pilot.
Technical Lead
Does the actual debugging and fixing. Communicates findings to the IC. Can delegate to other engineers. Makes technical recommendations; IC makes the final call on actions.
Communications Lead
Updates the status page, writes customer communications, keeps internal stakeholders informed. The IC shouldn't be writing status updates while coordinating the response — that's how things get missed.
The Incident Process (Step by Step)
Phase 1: Detection (0-5 min)
- Alert fires or customer reports issue
- On-call acknowledges the alert
- Quick triage: What's the impact? What severity level?
- Open an incident channel (Slack:
#inc-2025-10-13-payment-failure) - Assign IC, Technical Lead, Communications Lead
Phase 2: Mitigation (5-30 min)
- Goal: stop the bleeding. Not root cause — that comes later
- Check: was there a recent deployment? → Rollback
- Check: is it a capacity issue? → Scale up
- Check: is it a dependency? → Circuit breaker / feature flag off
- IC updates status page with initial message
Phase 3: Resolution (30 min - hours)
- If mitigation worked, investigate root cause while system is stable
- If mitigation didn't work, escalate: bring in more engineers, engage vendor support
- IC provides updates every 30 minutes (even if update is "still investigating")
- Document everything in the incident channel as it happens
Phase 4: Recovery (hours - days)
- Confirm system is fully operational
- Monitor for recurrence (set up temporary alert thresholds)
- Update status page: resolved
- Schedule post-mortem within 48 hours
Communication Templates
Writing coherent messages during an incident is hard. Templates remove the cognitive load.
Initial Customer Notification
Subject: [Service Name] — Investigating increased error rates
We're currently investigating an issue affecting [specific feature].
Some users may experience [specific symptom: slow loading, errors, inability to...].
Our team is actively working on a resolution. We'll provide an update within 30 minutes.
Status: Investigating
Impact: [High/Medium/Low]
Started: [time] UTC
Resolution Notification
Subject: [Service Name] — Issue resolved
The issue affecting [feature] has been resolved at [time] UTC.
What happened: [1-2 sentence summary]
Duration: [X hours Y minutes]
Impact: [what users experienced]
We're conducting a thorough review to prevent recurrence.
We apologize for the disruption.
On-Call That Doesn't Burn People Out
| Practice | Why It Matters |
|---|---|
| Rotate weekly | Longer rotations lead to burnout. Shorter ones don't give enough context. One week is the sweet spot |
| Compensate on-call | Extra pay or time off. On-call restricts personal time — that deserves compensation. ₹5,000-15,000/week is common in India |
| Follow-the-sun | If your team spans time zones, nobody should be on-call outside business hours. Mumbai hands off to a US-based engineer at night |
| Reduce alert noise | If on-call gets woken up 3 times a night for non-issues, people will mute their phones. Every false alert erodes trust in alerting |
| On-call handoff document | Outgoing on-call writes 5 lines: what happened this week, what's flaky, what to watch. Takes 5 minutes, saves hours |
Blameless Post-Mortems
"Blameless" doesn't mean "no accountability." It means focusing on system failures rather than human failures. The developer who deployed the bug isn't the root cause — the system that let a buggy deployment reach production is.
Post-Mortem Template
## Incident: [Title]
**Date:** [date] | **Duration:** [X hours Y min] | **Severity:** SEV-[N]
### Summary
[2-3 sentences: what happened, who was impacted, how it was resolved]
### Timeline (UTC)
- 14:23 — Alert fires: error rate exceeds 5%
- 14:25 — On-call acknowledges, opens incident channel
- 14:30 — IC assigned, begins triage
- 14:42 — Root cause identified: DB migration dropped an index
- 14:55 — Fix deployed: index recreated
- 15:05 — Error rate returns to baseline. Incident resolved.
### Root Cause
[Technical explanation. What broke and why]
### What Went Well
- Alert fired within 2 minutes of impact
- IC was assigned quickly
- Rollback plan was available
### What Went Poorly
- No pre-deploy check for missing indexes
- Status page wasn't updated until 25 min in
- Runbook for DB issues was outdated
### Action Items
| Action | Owner | Due Date | Priority |
|--------|-------|----------|----------|
| Add index validation to migration CI | @alice | Nov 20 | P1 |
| Update DB incident runbook | @bob | Nov 22 | P2 |
| Automate status page updates | @carol | Dec 5 | P2 |
Frequently Asked Questions
When should we page someone versus creating a ticket?
Page for customer-impacting issues (SEV-1/2). Create tickets for everything else. If you're unsure, err on the side of paging — a 30-second acknowledgment costs less than an hour of undetected downtime. Review your paging threshold quarterly and adjust based on false positive rates.
How do we prevent alert fatigue?
Every alert that fires must be actionable. If on-call looks at an alert and says "I'll check tomorrow," that alert shouldn't page. Review all alerts monthly: delete noisy ones, tighten thresholds, add context. Target: under 5 pages per on-call week. Above that, fix the alert quality before the system.
Should post-mortems be public or internal?
Internal post-mortems always. Public post-mortems for significant incidents (30+ minutes downtime) if your users expect transparency. Companies like Cloudflare and GitLab publish excellent public post-mortems that build trust. Sanitize any sensitive technical details before publishing.
What tools do we need for incident management?
At minimum: PagerDuty or Opsgenie for alerting/on-call, Slack or Teams for incident communication, and a status page (Statuspage.io, Instatus, or a simple self-hosted page). You don't need a dedicated incident management platform until you're handling 10+ incidents per month.