Every developer has a production debugging horror story. Ours involved a payment service that silently dropped transactions every Tuesday between 2-3 AM. It took us two weeks to find the root cause: a cron job that compacted the database ran at 2 AM, and a connection pool that didn't handle temporary connection resets. The symptoms pointed everywhere except the actual problem.
We learned a lot from that incident — mostly that our debugging process was ad hoc. Since then, we've developed a systematic approach that's cut our mean time to resolution (MTTR) by 60%.
What We'll Cover
Triage: The First 5 Minutes
When a production issue hits, the natural instinct is to start reading code. Resist that urge. The first 5 minutes should be pure observation and triage.
The Triage Checklist
- What's the impact? How many users are affected? Is it total downtime, degraded performance, or a specific feature? This determines urgency
- When did it start? Check monitoring dashboards. The exact timestamp narrows the search enormously
- What changed? Was there a deployment? A config change? A traffic spike? A dependency update? A database migration? Check the deployment log first — it's the cause 40% of the time
- Is it getting worse? A memory leak gets worse over time. A bad config change is constant. A traffic spike might resolve itself. This determines whether to act immediately or investigate carefully
- Can we mitigate immediately? Rollback, feature flag off, scale up, redirect traffic. Mitigate first, investigate second. Your users don't care about root cause while they're down
Observation: The Four Golden Signals
Google's SRE book defines four signals that tell you everything about system health. Check these in order:
| Signal | What to Check | Common Findings |
|---|---|---|
| Latency | p50, p95, p99 response times. Separate by endpoint | Spike in p99 but p50 is fine = one slow dependency. All percentiles high = systemic issue |
| Traffic | Request rate (RPS). Compare to normal patterns | Sudden spike = possible DDoS or viral content. Sudden drop = users can't reach you |
| Errors | Error rate by status code. 5xx = server, 4xx = client/input | New 500s after deploy = code bug. Intermittent 503s = resource exhaustion |
| Saturation | CPU, memory, disk, connection pool usage | CPU at 95% = need to scale or optimize. Memory climbing = leak. DB connections maxed = pool exhaustion |
Log Analysis That Actually Works
Most production debugging starts and ends with logs. But reading logs effectively is a skill. Here's our approach.
Structured Logging Is Non-Negotiable
// BAD: Unstructured log
"Payment failed for user john@example.com amount 299.99 error timeout"
// GOOD: Structured log (JSON)
{
"level": "error",
"message": "Payment processing failed",
"service": "payment-service",
"user_id": "usr_abc123",
"amount": 299.99,
"currency": "INR",
"payment_provider": "razorpay",
"error_type": "timeout",
"error_message": "Connection timed out after 30s",
"request_id": "req_xyz789",
"trace_id": "trace_def456",
"duration_ms": 30042,
"timestamp": "2025-10-13T14:23:45.123Z"
}
The structured version lets you query: "Show me all payment failures for Razorpay in the last hour grouped by error_type." The unstructured version requires regex and prayer.
Log Analysis Workflow
- Start with the error. Filter for ERROR/FATAL level in the timeframe. What errors are new or increased?
- Trace the request. Using the request_id or trace_id, follow one failing request through every service it touched. Where did it break?
- Look upstream. The error in Service C might be caused by Service B returning garbage because Service A timed out. Follow the chain
- Compare to baseline. What do the same logs look like for a successful request? The diff between success and failure logs reveals the issue
Distributed Tracing: Following Requests Across Services
In a microservices architecture, a single user request might touch 5-10 services. Without distributed tracing, debugging is like solving a murder mystery where the witnesses are in different countries and speak different languages.
Tools We Use
| Tool | Best For | Cost |
|---|---|---|
| Jaeger | Open source, self-hosted. Good UI for trace visualization | Free (infra costs only) |
| Grafana Tempo | If you already use Grafana stack. Integrates with Loki and Prometheus | Free (OSS) or Grafana Cloud pricing |
| Datadog APM | Best overall UX. Correlates traces with logs and metrics automatically | $31/host/month (adds up fast) |
| OpenTelemetry | Vendor-neutral instrumentation. Send data to any backend | Free (library). Backend costs vary |
Our recommendation: instrument with OpenTelemetry (vendor-neutral), send to Grafana Tempo (free) or Jaeger. Switch backends later without re-instrumenting.
Reproducing Production Issues Locally
The hardest part of debugging is often reproducing the issue. Here are patterns we use.
Techniques by Bug Type
- Data-dependent bugs: Pull a sanitized snapshot of the production data that triggers the issue. Docker + pg_dump makes this straightforward for database-driven bugs
- Timing-dependent bugs: Use chaos engineering tools (Toxiproxy) to inject latency and simulate the conditions. If the bug happens under load, use k6 to generate traffic locally
- Configuration-dependent bugs: Diff production config against local. Environment variables, feature flags, connection pool settings — these are often the culprit when "it works on my machine"
- Concurrency bugs: Write a test that spawns multiple goroutines/threads hitting the same endpoint simultaneously. Race conditions often need 100+ concurrent requests to manifest. Go's race detector (
-raceflag) catches many of these
Root Cause Analysis: The 5 Whys
After fixing the immediate issue, do a proper root cause analysis. We use the "5 Whys" framework — not because you always need exactly 5, but because people tend to stop at the first "why" and miss the systemic issue.
Example from our payment service incident:
- Why did payments fail? → Database connections timed out
- Why did connections time out? → Connection pool was exhausted
- Why was the pool exhausted? → The database was slow during the cron job window
- Why didn't the pool recover? → The pool's recovery logic didn't handle temporary connection resets
- Why wasn't this caught? → No monitoring on connection pool utilization, and no integration test for the cron + traffic combination
The fix wasn't just "add connection retry logic." It was: fix the pool recovery (immediate), add connection pool monitoring (short-term), run the cron job during a dedicated maintenance window (medium-term), and add chaos tests for database connection failures (long-term).
Prevention: Making the Same Bug Impossible
| After Every Incident | Action | Example |
|---|---|---|
| Write a regression test | Automate the exact scenario that broke | Integration test: payment request during simulated DB slowness |
| Add monitoring | Alert on the signal that would have caught this earlier | Alert when connection pool utilization exceeds 80% |
| Improve logging | Add the log line that would have made debugging faster | Log connection pool stats every 30 seconds |
| Update runbooks | Document the debugging steps for future on-call | "If payment errors spike, check connection pool dashboard first" |
| Share the post-mortem | Blameless write-up shared with the team | Timeline, root cause, action items, what we learned |
Frequently Asked Questions
Should we always rollback when a production issue occurs?
If the issue started with a recent deployment, rollback first and investigate second. Speed of mitigation matters more than understanding the root cause in the moment. If the issue isn't deployment-related, rollback won't help — focus on identifying the actual trigger.
How do we debug issues that only happen under high load?
Use a staging environment with production-like data and run load tests (k6, Locust). Profile under load — connection pools, thread counts, memory allocation, garbage collection pauses all behave differently under pressure. If the issue only reproduces in production, add targeted logging and wait for it to recur.
What's a good MTTR target?
For critical issues (total outage): under 30 minutes for mitigation, under 2 hours for root cause. For degraded performance: under 1 hour for mitigation. DORA elite performers have MTTR under 1 hour. More important than the number is the trend — MTTR should decrease over time as your observability and runbooks improve.
How do we do blameless post-mortems?
Focus on systems, not people. Instead of "Alice deployed broken code," write "the deployment pipeline didn't catch the regression because integration tests were skipped for performance." The question is always "how did our system allow this to happen?" not "who made the mistake?" People make mistakes; systems should prevent them from reaching users.