Ideas Engineered for Tomorrow
We Engineer Services & Solutions for Your Business Needs
Home About
Products
Services
Hire
Industries
Consulting
Partners
Articles Careers Contact
Cloud & DevOps

Incident Management That Actually Reduces Downtime

Your system will go down. The question is: does your team know exactly what to do in the first 5 minutes? Most don't. Here's how to fix that.

November 17, 2025 13 min read

We managed our first major incident the way most teams do — chaos. Three people SSHing into production simultaneously, conflicting Slack threads, no one sure who was in charge, and a customer finding out about the outage from Twitter before we told them. It took 4 hours to resolve a problem that should have taken 30 minutes. That's when we built our incident management process.

Severity Classification: Agree Before the Incident

During an incident is the worst time to debate "how bad is this?" Define severity levels upfront so the response is automatic.

Severity Impact Examples Response Time Who Gets Paged
SEV-1 (Critical) Total service outage or data loss Site down, payment processing broken, data breach Immediate (24/7) On-call + engineering lead + VP Engineering
SEV-2 (Major) Major feature broken, degraded for many users Search not working, login failures for 30% of users, API errors spiking Within 30 minutes On-call + team lead
SEV-3 (Minor) Feature degraded for some users Slow loading for one region, email notifications delayed, minor UI bug Within 4 hours (business hours) On-call
SEV-4 (Low) Cosmetic or non-customer-impacting issue Internal tool broken, monitoring alert noise, non-critical cron failure Next business day Filed as ticket

The most common mistake: everything is SEV-1. If every incident is critical, none are. We've seen teams where "the admin dashboard is slow" triggers the same response as "payments are down." That's how alert fatigue starts.

Incident Roles: Who Does What

Three roles. No more. Assign them in the first 2 minutes.

Incident Commander (IC)

Coordinates the response. Doesn't debug — that's deliberate. The IC's job is to ask questions, track progress, make decisions (escalate? rollback? communicate?), and keep the team focused. Think air traffic controller, not pilot.

Technical Lead

Does the actual debugging and fixing. Communicates findings to the IC. Can delegate to other engineers. Makes technical recommendations; IC makes the final call on actions.

Communications Lead

Updates the status page, writes customer communications, keeps internal stakeholders informed. The IC shouldn't be writing status updates while coordinating the response — that's how things get missed.

For small teams (under 10): The IC and communications lead can be the same person if the incident is SEV-3 or lower. For SEV-1 and SEV-2, always separate these roles even if it means pulling someone off their regular work.

The Incident Process (Step by Step)

Phase 1: Detection (0-5 min)

  • Alert fires or customer reports issue
  • On-call acknowledges the alert
  • Quick triage: What's the impact? What severity level?
  • Open an incident channel (Slack: #inc-2025-10-13-payment-failure)
  • Assign IC, Technical Lead, Communications Lead

Phase 2: Mitigation (5-30 min)

  • Goal: stop the bleeding. Not root cause — that comes later
  • Check: was there a recent deployment? → Rollback
  • Check: is it a capacity issue? → Scale up
  • Check: is it a dependency? → Circuit breaker / feature flag off
  • IC updates status page with initial message

Phase 3: Resolution (30 min - hours)

  • If mitigation worked, investigate root cause while system is stable
  • If mitigation didn't work, escalate: bring in more engineers, engage vendor support
  • IC provides updates every 30 minutes (even if update is "still investigating")
  • Document everything in the incident channel as it happens

Phase 4: Recovery (hours - days)

  • Confirm system is fully operational
  • Monitor for recurrence (set up temporary alert thresholds)
  • Update status page: resolved
  • Schedule post-mortem within 48 hours

Communication Templates

Writing coherent messages during an incident is hard. Templates remove the cognitive load.

Initial Customer Notification

Subject: [Service Name] — Investigating increased error rates

We're currently investigating an issue affecting [specific feature].
Some users may experience [specific symptom: slow loading, errors, inability to...].

Our team is actively working on a resolution. We'll provide an update within 30 minutes.

Status: Investigating
Impact: [High/Medium/Low]
Started: [time] UTC

Resolution Notification

Subject: [Service Name] — Issue resolved

The issue affecting [feature] has been resolved at [time] UTC.

What happened: [1-2 sentence summary]
Duration: [X hours Y minutes]
Impact: [what users experienced]

We're conducting a thorough review to prevent recurrence.
We apologize for the disruption.

On-Call That Doesn't Burn People Out

Practice Why It Matters
Rotate weekly Longer rotations lead to burnout. Shorter ones don't give enough context. One week is the sweet spot
Compensate on-call Extra pay or time off. On-call restricts personal time — that deserves compensation. ₹5,000-15,000/week is common in India
Follow-the-sun If your team spans time zones, nobody should be on-call outside business hours. Mumbai hands off to a US-based engineer at night
Reduce alert noise If on-call gets woken up 3 times a night for non-issues, people will mute their phones. Every false alert erodes trust in alerting
On-call handoff document Outgoing on-call writes 5 lines: what happened this week, what's flaky, what to watch. Takes 5 minutes, saves hours

Blameless Post-Mortems

"Blameless" doesn't mean "no accountability." It means focusing on system failures rather than human failures. The developer who deployed the bug isn't the root cause — the system that let a buggy deployment reach production is.

Post-Mortem Template

## Incident: [Title]
**Date:** [date] | **Duration:** [X hours Y min] | **Severity:** SEV-[N]

### Summary
[2-3 sentences: what happened, who was impacted, how it was resolved]

### Timeline (UTC)
- 14:23 — Alert fires: error rate exceeds 5%
- 14:25 — On-call acknowledges, opens incident channel
- 14:30 — IC assigned, begins triage
- 14:42 — Root cause identified: DB migration dropped an index
- 14:55 — Fix deployed: index recreated
- 15:05 — Error rate returns to baseline. Incident resolved.

### Root Cause
[Technical explanation. What broke and why]

### What Went Well
- Alert fired within 2 minutes of impact
- IC was assigned quickly
- Rollback plan was available

### What Went Poorly
- No pre-deploy check for missing indexes
- Status page wasn't updated until 25 min in
- Runbook for DB issues was outdated

### Action Items
| Action | Owner | Due Date | Priority |
|--------|-------|----------|----------|
| Add index validation to migration CI | @alice | Nov 20 | P1 |
| Update DB incident runbook | @bob | Nov 22 | P2 |
| Automate status page updates | @carol | Dec 5 | P2 |
The most important part is the action items. A post-mortem without action items is just a story. Every action item needs an owner and a due date. Review open action items weekly until they're done.

Frequently Asked Questions

When should we page someone versus creating a ticket?

Page for customer-impacting issues (SEV-1/2). Create tickets for everything else. If you're unsure, err on the side of paging — a 30-second acknowledgment costs less than an hour of undetected downtime. Review your paging threshold quarterly and adjust based on false positive rates.

How do we prevent alert fatigue?

Every alert that fires must be actionable. If on-call looks at an alert and says "I'll check tomorrow," that alert shouldn't page. Review all alerts monthly: delete noisy ones, tighten thresholds, add context. Target: under 5 pages per on-call week. Above that, fix the alert quality before the system.

Should post-mortems be public or internal?

Internal post-mortems always. Public post-mortems for significant incidents (30+ minutes downtime) if your users expect transparency. Companies like Cloudflare and GitLab publish excellent public post-mortems that build trust. Sanitize any sensitive technical details before publishing.

What tools do we need for incident management?

At minimum: PagerDuty or Opsgenie for alerting/on-call, Slack or Teams for incident communication, and a status page (Statuspage.io, Instatus, or a simple self-hosted page). You don't need a dedicated incident management platform until you're handling 10+ incidents per month.

Pillai Infotech Engineering Team

We've managed incidents for payment systems, healthcare platforms, and high-traffic SaaS products. Our process was forged in real incidents — every improvement came from something that went wrong.

Need Help Building Your Incident Process?

From on-call setup to post-mortem culture — we help teams respond faster and learn more from every incident.

Get a Free Consultation Our Services