Incident Management Best Practices

Q: How do we prevent alert fatigue?

Every alert must be actionable. Review monthly. Target under 5 pages per on-call week.

Q: Should post-mortems be public or internal?

Internal always. Public for significant incidents if your users expect transparency. Sanitize sensitive details.

Q: What tools do we need for incident management?

PagerDuty/Opsgenie for alerting, Slack for communication, and a status page. You don't need more until 10+ incidents/month.

We managed our first major incident the way most teams do — chaos. Three people SSHing into production simultaneously, conflicting Slack threads, no one sure who was in charge, and a customer finding out about the outage from Twitter before we told them. It took 4 hours to resolve a problem that should have taken 30 minutes. That's when we built our incident management process.

Severity Classification: Agree Before the Incident

During an incident is the worst time to debate "how bad is this?" Define severity levels upfront so the response is automatic.

Severity	Impact	Examples	Response Time	Who Gets Paged
SEV-1 (Critical)	Total service outage or data loss	Site down, payment processing broken, data breach	Immediate (24/7)	On-call + engineering lead + VP Engineering
SEV-2 (Major)	Major feature broken, degraded for many users	Search not working, login failures for 30% of users, API errors spiking	Within 30 minutes	On-call + team lead
SEV-3 (Minor)	Feature degraded for some users	Slow loading for one region, email notifications delayed, minor UI bug	Within 4 hours (business hours)	On-call
SEV-4 (Low)	Cosmetic or non-customer-impacting issue	Internal tool broken, monitoring alert noise, non-critical cron failure	Next business day	Filed as ticket

The most common mistake: everything is SEV-1. If every incident is critical, none are. We've seen teams where "the admin dashboard is slow" triggers the same response as "payments are down." That's how alert fatigue starts.

Incident Roles: Who Does What

Three roles. No more. Assign them in the first 2 minutes.

Incident Commander (IC)

Coordinates the response. Doesn't debug — that's deliberate. The IC's job is to ask questions, track progress, make decisions (escalate? rollback? communicate?), and keep the team focused. Think air traffic controller, not pilot.

Technical Lead

Does the actual debugging and fixing. Communicates findings to the IC. Can delegate to other engineers. Makes technical recommendations; IC makes the final call on actions.

Communications Lead

Updates the status page, writes customer communications, keeps internal stakeholders informed. The IC shouldn't be writing status updates while coordinating the response — that's how things get missed.

For small teams (under 10): The IC and communications lead can be the same person if the incident is SEV-3 or lower. For SEV-1 and SEV-2, always separate these roles even if it means pulling someone off their regular work.

The Incident Process (Step by Step)

Phase 1: Detection (0-5 min)

Alert fires or customer reports issue
On-call acknowledges the alert
Quick triage: What's the impact? What severity level?
Open an incident channel (Slack: #inc-2025-10-13-payment-failure)
Assign IC, Technical Lead, Communications Lead

Phase 2: Mitigation (5-30 min)

Goal: stop the bleeding. Not root cause — that comes later
Check: was there a recent deployment? → Rollback
Check: is it a capacity issue? → Scale up
Check: is it a dependency? → Circuit breaker / feature flag off
IC updates status page with initial message

Phase 3: Resolution (30 min - hours)

If mitigation worked, investigate root cause while system is stable
If mitigation didn't work, escalate: bring in more engineers, engage vendor support
IC provides updates every 30 minutes (even if update is "still investigating")
Document everything in the incident channel as it happens

Phase 4: Recovery (hours - days)

Confirm system is fully operational
Monitor for recurrence (set up temporary alert thresholds)
Update status page: resolved
Schedule post-mortem within 48 hours

Communication Templates

Writing coherent messages during an incident is hard. Templates remove the cognitive load.

Initial Customer Notification

Subject: [Service Name] — Investigating increased error rates

We're currently investigating an issue affecting [specific feature].
Some users may experience [specific symptom: slow loading, errors, inability to...].

Our team is actively working on a resolution. We'll provide an update within 30 minutes.

Status: Investigating
Impact: [High/Medium/Low]
Started: [time] UTC

Resolution Notification

Subject: [Service Name] — Issue resolved

The issue affecting [feature] has been resolved at [time] UTC.

What happened: [1-2 sentence summary]
Duration: [X hours Y minutes]
Impact: [what users experienced]

We're conducting a thorough review to prevent recurrence.
We apologize for the disruption.

On-Call That Doesn't Burn People Out

Practice	Why It Matters
Rotate weekly	Longer rotations lead to burnout. Shorter ones don't give enough context. One week is the sweet spot
Compensate on-call	Extra pay or time off. On-call restricts personal time — that deserves compensation. ₹5,000-15,000/week is common in India
Follow-the-sun	If your team spans time zones, nobody should be on-call outside business hours. Mumbai hands off to a US-based engineer at night
Reduce alert noise	If on-call gets woken up 3 times a night for non-issues, people will mute their phones. Every false alert erodes trust in alerting
On-call handoff document	Outgoing on-call writes 5 lines: what happened this week, what's flaky, what to watch. Takes 5 minutes, saves hours

Blameless Post-Mortems

"Blameless" doesn't mean "no accountability." It means focusing on system failures rather than human failures. The developer who deployed the bug isn't the root cause — the system that let a buggy deployment reach production is.

Post-Mortem Template

## Incident: [Title]
**Date:** [date] | **Duration:** [X hours Y min] | **Severity:** SEV-[N]

### Summary
[2-3 sentences: what happened, who was impacted, how it was resolved]

### Timeline (UTC)
- 14:23 — Alert fires: error rate exceeds 5%
- 14:25 — On-call acknowledges, opens incident channel
- 14:30 — IC assigned, begins triage
- 14:42 — Root cause identified: DB migration dropped an index
- 14:55 — Fix deployed: index recreated
- 15:05 — Error rate returns to baseline. Incident resolved.

### Root Cause
[Technical explanation. What broke and why]

### What Went Well
- Alert fired within 2 minutes of impact
- IC was assigned quickly
- Rollback plan was available

### What Went Poorly
- No pre-deploy check for missing indexes
- Status page wasn't updated until 25 min in
- Runbook for DB issues was outdated

### Action Items
| Action | Owner | Due Date | Priority |
|--------|-------|----------|----------|
| Add index validation to migration CI | @alice | Nov 20 | P1 |
| Update DB incident runbook | @bob | Nov 22 | P2 |
| Automate status page updates | @carol | Dec 5 | P2 |

The most important part is the action items. A post-mortem without action items is just a story. Every action item needs an owner and a due date. Review open action items weekly until they're done.

Frequently Asked Questions

When should we page someone versus creating a ticket?

Page for customer-impacting issues (SEV-1/2). Create tickets for everything else. If you're unsure, err on the side of paging — a 30-second acknowledgment costs less than an hour of undetected downtime. Review your paging threshold quarterly and adjust based on false positive rates.

How do we prevent alert fatigue?

Every alert that fires must be actionable. If on-call looks at an alert and says "I'll check tomorrow," that alert shouldn't page. Review all alerts monthly: delete noisy ones, tighten thresholds, add context. Target: under 5 pages per on-call week. Above that, fix the alert quality before the system.

Should post-mortems be public or internal?

Internal post-mortems always. Public post-mortems for significant incidents (30+ minutes downtime) if your users expect transparency. Companies like Cloudflare and GitLab publish excellent public post-mortems that build trust. Sanitize any sensitive technical details before publishing.

What tools do we need for incident management?

At minimum: PagerDuty or Opsgenie for alerting/on-call, Slack or Teams for incident communication, and a status page (Statuspage.io, Instatus, or a simple self-hosted page). You don't need a dedicated incident management platform until you're handling 10+ incidents per month.

Pillai Infotech Engineering Team

We've managed incidents for payment systems, healthcare platforms, and high-traffic SaaS products. Our process was forged in real incidents — every improvement came from something that went wrong.

Incident Management That Actually Reduces Downtime

What We'll Cover

Severity Classification: Agree Before the Incident

Incident Roles: Who Does What

Incident Commander (IC)

Technical Lead

Communications Lead

The Incident Process (Step by Step)

Phase 1: Detection (0-5 min)

Phase 2: Mitigation (5-30 min)

Phase 3: Resolution (30 min - hours)

Phase 4: Recovery (hours - days)

Communication Templates

Initial Customer Notification

Resolution Notification

On-Call That Doesn't Burn People Out

Blameless Post-Mortems

Post-Mortem Template

Frequently Asked Questions

When should we page someone versus creating a ticket?

How do we prevent alert fatigue?

Should post-mortems be public or internal?

What tools do we need for incident management?

Pillai Infotech Engineering Team

Related Articles

Need Help Building Your Incident Process?

Related Articles

What is Agentic AI?Complete guide to autonomous AI agents

AI Agents in EnterpriseHow agents are transforming workflows

RAG GuideRetrieval-augmented generation explained

Prompt EngineeringAdvanced techniques for developers

Generative AI Use CasesReal-world business applications

SLMs vs LLMsWhen small models beat large ones

MLOps GuideProduction ML lifecycle management

Vector DatabasesEmbeddings, similarity search, use cases

AI in Software DevHow AI is changing how we build

AI Coding AssistantsCopilot, Claude, and the future

Computer VisionBusiness applications & use cases

React vs AngularWhich frontend framework to choose

Next.js vs Nuxt.jsSSR framework comparison 2026

TypeScript Best PracticesType safety patterns & tips

GraphQL vs RESTAPI design approaches compared

Python vs Node.jsBackend language decision guide

Rust vs GoSystems programming showdown

Full-Stack Trends 2026What's shaping full-stack in 2026

PWA GuideBuilding installable web apps

Svelte vs ReactLightweight alternative showdown

Web PerformanceSpeed optimization techniques

Low-Code vs CustomWhen to build vs buy

AWS vs Azure vs GCPCloud platform comparison 2026

Kubernetes vs Docker SwarmContainer orchestration compared

Terraform GuideInfrastructure as Code best practices

CI/CD Best PracticesPipeline design & optimization

Cloud Native GuideBuilding for the cloud from day one

Serverless ArchitectureWhen & when not to go serverless

Docker Best PracticesContainer patterns & anti-patterns

DevOps Best PracticesFor startups & enterprises

Incident Management That Actually Reduces Downtime

What We'll Cover

Severity Classification: Agree Before the Incident

Incident Roles: Who Does What

Incident Commander (IC)

Technical Lead

Communications Lead

The Incident Process (Step by Step)

Phase 1: Detection (0-5 min)

Phase 2: Mitigation (5-30 min)

Phase 3: Resolution (30 min - hours)

Phase 4: Recovery (hours - days)

Communication Templates

Initial Customer Notification

Resolution Notification

On-Call That Doesn't Burn People Out

Blameless Post-Mortems

Post-Mortem Template

Frequently Asked Questions

When should we page someone versus creating a ticket?

How do we prevent alert fatigue?

Should post-mortems be public or internal?

What tools do we need for incident management?

Pillai Infotech Engineering Team

Related Articles

Need Help Building Your Incident Process?

Book a Free Consultation

Your Details

Pick a 30-min Slot

Thank You!