Debugging Production Issues Guide

Q: Should we always rollback when a production issue occurs?

If the issue started with a deployment, rollback first and investigate second. If not deployment-related, focus on identifying the trigger.

Q: How do we debug issues that only happen under high load?

Use staging with production-like data and load tests. Profile under load. If it only reproduces in prod, add targeted logging and wait for recurrence.

Q: What's a good MTTR target?

Under 30 minutes for mitigation of critical issues. DORA elite performers achieve under 1 hour. The trend matters more than the number.

Q: How do we do blameless post-mortems?

Focus on systems, not people. Ask how the system allowed this to happen, not who made the mistake.

Every developer has a production debugging horror story. Ours involved a payment service that silently dropped transactions every Tuesday between 2-3 AM. It took us two weeks to find the root cause: a cron job that compacted the database ran at 2 AM, and a connection pool that didn't handle temporary connection resets. The symptoms pointed everywhere except the actual problem.

We learned a lot from that incident — mostly that our debugging process was ad hoc. Since then, we've developed a systematic approach that's cut our mean time to resolution (MTTR) by 60%.

What We'll Cover

Triage: First 5 Minutes
Observation: What Changed?
Log Analysis That Actually Works
Distributed Tracing
Reproducing Production Issues
Root Cause Analysis
Prevention: Making the Same Bug Impossible
FAQ

Triage: The First 5 Minutes

When a production issue hits, the natural instinct is to start reading code. Resist that urge. The first 5 minutes should be pure observation and triage.

The Triage Checklist

What's the impact? How many users are affected? Is it total downtime, degraded performance, or a specific feature? This determines urgency
When did it start? Check monitoring dashboards. The exact timestamp narrows the search enormously
What changed? Was there a deployment? A config change? A traffic spike? A dependency update? A database migration? Check the deployment log first — it's the cause 40% of the time
Is it getting worse? A memory leak gets worse over time. A bad config change is constant. A traffic spike might resolve itself. This determines whether to act immediately or investigate carefully
Can we mitigate immediately? Rollback, feature flag off, scale up, redirect traffic. Mitigate first, investigate second. Your users don't care about root cause while they're down

The #1 triage mistake: Jumping to conclusions. "It's probably the database" → spend 2 hours investigating the database → it's actually a DNS issue. Follow the data, not your gut.

Observation: The Four Golden Signals

Google's SRE book defines four signals that tell you everything about system health. Check these in order:

Signal	What to Check	Common Findings
Latency	p50, p95, p99 response times. Separate by endpoint	Spike in p99 but p50 is fine = one slow dependency. All percentiles high = systemic issue
Traffic	Request rate (RPS). Compare to normal patterns	Sudden spike = possible DDoS or viral content. Sudden drop = users can't reach you
Errors	Error rate by status code. 5xx = server, 4xx = client/input	New 500s after deploy = code bug. Intermittent 503s = resource exhaustion
Saturation	CPU, memory, disk, connection pool usage	CPU at 95% = need to scale or optimize. Memory climbing = leak. DB connections maxed = pool exhaustion

Log Analysis That Actually Works

Most production debugging starts and ends with logs. But reading logs effectively is a skill. Here's our approach.

Structured Logging Is Non-Negotiable

// BAD: Unstructured log
"Payment failed for user john@example.com amount 299.99 error timeout"

// GOOD: Structured log (JSON)
{
  "level": "error",
  "message": "Payment processing failed",
  "service": "payment-service",
  "user_id": "usr_abc123",
  "amount": 299.99,
  "currency": "INR",
  "payment_provider": "razorpay",
  "error_type": "timeout",
  "error_message": "Connection timed out after 30s",
  "request_id": "req_xyz789",
  "trace_id": "trace_def456",
  "duration_ms": 30042,
  "timestamp": "2025-10-13T14:23:45.123Z"
}

The structured version lets you query: "Show me all payment failures for Razorpay in the last hour grouped by error_type." The unstructured version requires regex and prayer.

Log Analysis Workflow

Start with the error. Filter for ERROR/FATAL level in the timeframe. What errors are new or increased?
Trace the request. Using the request_id or trace_id, follow one failing request through every service it touched. Where did it break?
Look upstream. The error in Service C might be caused by Service B returning garbage because Service A timed out. Follow the chain
Compare to baseline. What do the same logs look like for a successful request? The diff between success and failure logs reveals the issue

Distributed Tracing: Following Requests Across Services

In a microservices architecture, a single user request might touch 5-10 services. Without distributed tracing, debugging is like solving a murder mystery where the witnesses are in different countries and speak different languages.

Tools We Use

Tool	Best For	Cost
Jaeger	Open source, self-hosted. Good UI for trace visualization	Free (infra costs only)
Grafana Tempo	If you already use Grafana stack. Integrates with Loki and Prometheus	Free (OSS) or Grafana Cloud pricing
Datadog APM	Best overall UX. Correlates traces with logs and metrics automatically	$31/host/month (adds up fast)
OpenTelemetry	Vendor-neutral instrumentation. Send data to any backend	Free (library). Backend costs vary

Our recommendation: instrument with OpenTelemetry (vendor-neutral), send to Grafana Tempo (free) or Jaeger. Switch backends later without re-instrumenting.

Reproducing Production Issues Locally

The hardest part of debugging is often reproducing the issue. Here are patterns we use.

Techniques by Bug Type

Data-dependent bugs: Pull a sanitized snapshot of the production data that triggers the issue. Docker + pg_dump makes this straightforward for database-driven bugs
Timing-dependent bugs: Use chaos engineering tools (Toxiproxy) to inject latency and simulate the conditions. If the bug happens under load, use k6 to generate traffic locally
Configuration-dependent bugs: Diff production config against local. Environment variables, feature flags, connection pool settings — these are often the culprit when "it works on my machine"
Concurrency bugs: Write a test that spawns multiple goroutines/threads hitting the same endpoint simultaneously. Race conditions often need 100+ concurrent requests to manifest. Go's race detector (-race flag) catches many of these

Root Cause Analysis: The 5 Whys

After fixing the immediate issue, do a proper root cause analysis. We use the "5 Whys" framework — not because you always need exactly 5, but because people tend to stop at the first "why" and miss the systemic issue.

Example from our payment service incident:

Why did payments fail? → Database connections timed out
Why did connections time out? → Connection pool was exhausted
Why was the pool exhausted? → The database was slow during the cron job window
Why didn't the pool recover? → The pool's recovery logic didn't handle temporary connection resets
Why wasn't this caught? → No monitoring on connection pool utilization, and no integration test for the cron + traffic combination

The fix wasn't just "add connection retry logic." It was: fix the pool recovery (immediate), add connection pool monitoring (short-term), run the cron job during a dedicated maintenance window (medium-term), and add chaos tests for database connection failures (long-term).

Prevention: Making the Same Bug Impossible

After Every Incident	Action	Example
Write a regression test	Automate the exact scenario that broke	Integration test: payment request during simulated DB slowness
Add monitoring	Alert on the signal that would have caught this earlier	Alert when connection pool utilization exceeds 80%
Improve logging	Add the log line that would have made debugging faster	Log connection pool stats every 30 seconds
Update runbooks	Document the debugging steps for future on-call	"If payment errors spike, check connection pool dashboard first"
Share the post-mortem	Blameless write-up shared with the team	Timeline, root cause, action items, what we learned

Frequently Asked Questions

Should we always rollback when a production issue occurs?

If the issue started with a recent deployment, rollback first and investigate second. Speed of mitigation matters more than understanding the root cause in the moment. If the issue isn't deployment-related, rollback won't help — focus on identifying the actual trigger.

How do we debug issues that only happen under high load?

Use a staging environment with production-like data and run load tests (k6, Locust). Profile under load — connection pools, thread counts, memory allocation, garbage collection pauses all behave differently under pressure. If the issue only reproduces in production, add targeted logging and wait for it to recur.

What's a good MTTR target?

For critical issues (total outage): under 30 minutes for mitigation, under 2 hours for root cause. For degraded performance: under 1 hour for mitigation. DORA elite performers have MTTR under 1 hour. More important than the number is the trend — MTTR should decrease over time as your observability and runbooks improve.

How do we do blameless post-mortems?

Focus on systems, not people. Instead of "Alice deployed broken code," write "the deployment pipeline didn't catch the regression because integration tests were skipped for performance." The question is always "how did our system allow this to happen?" not "who made the mistake?" People make mistakes; systems should prevent them from reaching users.

Pillai Infotech Engineering Team

We've debugged production issues across payment systems, healthcare platforms, and high-traffic SaaS applications. The systematic approach described here is what we actually use — born from incidents, not textbooks.

Debugging Production Issues Without Losing Your Mind

What We'll Cover

Triage: The First 5 Minutes

The Triage Checklist

Observation: The Four Golden Signals

Log Analysis That Actually Works

Structured Logging Is Non-Negotiable

Log Analysis Workflow

Distributed Tracing: Following Requests Across Services

Tools We Use

Reproducing Production Issues Locally

Techniques by Bug Type

Root Cause Analysis: The 5 Whys

Prevention: Making the Same Bug Impossible

Frequently Asked Questions

Should we always rollback when a production issue occurs?

How do we debug issues that only happen under high load?

What's a good MTTR target?

How do we do blameless post-mortems?

Pillai Infotech Engineering Team

Related Articles

Need Help With Production Reliability?

Related Articles

What is Agentic AI?Complete guide to autonomous AI agents

AI Agents in EnterpriseHow agents are transforming workflows

RAG GuideRetrieval-augmented generation explained

Prompt EngineeringAdvanced techniques for developers

Generative AI Use CasesReal-world business applications

SLMs vs LLMsWhen small models beat large ones

MLOps GuideProduction ML lifecycle management

Vector DatabasesEmbeddings, similarity search, use cases

AI in Software DevHow AI is changing how we build

AI Coding AssistantsCopilot, Claude, and the future

Computer VisionBusiness applications & use cases

React vs AngularWhich frontend framework to choose

Next.js vs Nuxt.jsSSR framework comparison 2026

TypeScript Best PracticesType safety patterns & tips

GraphQL vs RESTAPI design approaches compared

Python vs Node.jsBackend language decision guide

Rust vs GoSystems programming showdown

Full-Stack Trends 2026What's shaping full-stack in 2026

PWA GuideBuilding installable web apps

Svelte vs ReactLightweight alternative showdown

Web PerformanceSpeed optimization techniques

Low-Code vs CustomWhen to build vs buy

AWS vs Azure vs GCPCloud platform comparison 2026

Kubernetes vs Docker SwarmContainer orchestration compared

Terraform GuideInfrastructure as Code best practices

CI/CD Best PracticesPipeline design & optimization

Cloud Native GuideBuilding for the cloud from day one

Serverless ArchitectureWhen & when not to go serverless

Docker Best PracticesContainer patterns & anti-patterns

DevOps Best PracticesFor startups & enterprises

Debugging Production Issues Without Losing Your Mind

What We'll Cover

Triage: The First 5 Minutes

The Triage Checklist

Observation: The Four Golden Signals

Log Analysis That Actually Works

Structured Logging Is Non-Negotiable

Log Analysis Workflow

Distributed Tracing: Following Requests Across Services

Tools We Use

Reproducing Production Issues Locally

Techniques by Bug Type

Root Cause Analysis: The 5 Whys

Prevention: Making the Same Bug Impossible

Frequently Asked Questions

Should we always rollback when a production issue occurs?

How do we debug issues that only happen under high load?

What's a good MTTR target?

How do we do blameless post-mortems?

Pillai Infotech Engineering Team

Related Articles

Need Help With Production Reliability?

Book a Free Consultation

Your Details

Pick a 30-min Slot

Thank You!