Ideas Engineered for Tomorrow
We Engineer Services & Solutions for Your Business Needs
Home About
Products
Services
Hire
Industries
Consulting
Partners
Articles Careers Contact
Cloud & DevOps

Monitoring and Observability for Modern Applications

Monitoring tells you when something is wrong. Observability tells you why. Here's how to build both — from metrics and logs to distributed traces and intelligent alerting.

March 11, 2026 17 min read
In this article

At 2:47 AM, the on-call engineer gets paged: "High error rate on checkout service." They open the dashboard. Errors are spiking, but from where? They check logs — 10,000 lines per minute, no obvious pattern. They check the database — looks fine. They check the cache — fine. They SSH into servers and start grepping. Forty minutes later, they find it: a downstream payment provider changed their API response format. Mean time to resolution: 52 minutes. Customer impact: $180K in failed checkouts.

Now imagine the same incident with proper observability: The alert fires with a link to the exact trace showing the failing request. The trace shows the payment-service span returning a 422 error. The structured log attached to that span shows the exact payload mismatch. Time to diagnosis: 3 minutes. Time to mitigation (circuit breaker to fallback provider): 5 minutes. Customer impact: $8K.

That's the difference between monitoring and observability. At Pillai Infotech, we've built observability stacks for applications handling millions of requests daily. This guide covers what actually works.

Monitoring vs Observability: What's the Difference?

Monitoring answers: "Is the system working?" It checks predefined conditions — CPU above 90%, error rate above 1%, disk above 80% — and alerts when thresholds are crossed. Monitoring works for known failure modes.

Observability answers: "Why is the system broken?" It gives you the ability to ask arbitrary questions about your system's behavior without deploying new code. Observability works for unknown failure modes — the failures you didn't predict.

In a monolith, monitoring is often sufficient. In distributed systems (microservices, serverless, event-driven architectures), monitoring alone fails because failures emerge from interactions between services that no single metric captures. You need observability.

Put simply: Monitoring is for dashboards. Observability is for debugging. You need both — dashboards to know something is wrong, and the ability to drill into why.

The Three Pillars (and Why They're Not Enough)

The traditional "three pillars of observability" are metrics, logs, and traces. They're necessary but not sufficient on their own — the real power comes from correlating all three.

Pillar What It Tells You Strength Weakness
Metrics System state over time (numeric) Cheap, fast, good for alerting Low cardinality — can't drill into specifics
Logs What happened (events) Rich context, human-readable Expensive at scale, hard to correlate
Traces Request flow across services Shows causality and latency breakdown Sampling required at scale, complex setup

The emerging "fourth pillar" is profiles — continuous profiling that shows where CPU and memory are spent at the function level. Tools like Pyroscope and Grafana Phlare make this practical. But start with the three pillars — profiling is an optimization, not a foundation.

Metrics Done Right

The RED Method (for Services)

For every service, track these three metrics:

  • Rate: Requests per second. How much traffic is the service handling?
  • Errors: Errors per second (or error rate). What percentage of requests are failing?
  • Duration: Latency distribution (p50, p95, p99). How long do requests take?

RED gives you immediate insight into service health. A spike in errors or latency, or a drop in rate, tells you something is wrong — fast.

The USE Method (for Infrastructure)

For every infrastructure resource (CPU, memory, disk, network):

  • Utilization: How busy is the resource? (CPU: 75%)
  • Saturation: Is work queuing up? (CPU run queue: 15)
  • Errors: Is the resource producing errors? (disk: I/O errors)

What NOT to Measure

Over-instrumentation is as harmful as under-instrumentation. Every metric costs storage, query time, and cognitive load. Avoid:

  • Vanity metrics: Total requests all-time. Useless for diagnosing issues.
  • High-cardinality labels: Don't label metrics by user ID or request ID — this creates millions of time series and crashes your monitoring system.
  • Internal implementation details: Don't expose metrics about internal queues or cache implementations that change frequently. Focus on external-facing SLIs.

Histogram vs. Summary

For latency, use histograms (not summaries). Histograms are aggregatable across instances — you can compute accurate percentiles for the entire service, not just individual pods. In Prometheus:

# Good — histogram with meaningful buckets
http_request_duration_seconds_bucket{le="0.01"} 2834
http_request_duration_seconds_bucket{le="0.05"} 8923
http_request_duration_seconds_bucket{le="0.1"}  12847
http_request_duration_seconds_bucket{le="0.5"}  14231
http_request_duration_seconds_bucket{le="1.0"}  14298
http_request_duration_seconds_bucket{le="+Inf"} 14305

# Query p99 latency:
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

Structured Logging: The Foundation

Stop Writing Unstructured Logs

This log line is nearly useless for automated analysis:

2026-03-11 14:23:45 ERROR - Failed to process order #12345 for user john@example.com: payment declined

This structured log is queryable, filterable, and correlatable:

{
  "timestamp": "2026-03-11T14:23:45.123Z",
  "level": "error",
  "service": "order-service",
  "trace_id": "abc123def456",
  "span_id": "789ghi",
  "message": "Payment declined",
  "order_id": "12345",
  "user_id": "u_9876",
  "payment_provider": "stripe",
  "error_code": "card_declined",
  "amount_cents": 4999,
  "currency": "USD"
}

With structured logs, you can query: "Show me all payment failures from Stripe in the last hour where amount > $100" — impossible with plain text logs.

Logging Best Practices

  • Always include trace_id: Every log line should carry the distributed trace ID. This is the link between logs and traces.
  • Log at the right level: ERROR = needs human attention. WARN = unexpected but handled. INFO = significant state changes. DEBUG = development only (never in production).
  • Don't log PII: User emails, IP addresses, and payment details should be redacted or hashed. Compliance (GDPR, HIPAA) requires it. Use structured logging libraries that support field-level redaction.
  • Log actionable context: "Database connection failed" is unhelpful. "Database connection failed: host=db-primary.internal, port=5432, pool_size=20, active_connections=20, wait_queue=47" tells you exactly what happened.

Log Aggregation Architecture

For most teams, we recommend:

  • Small scale (<100GB/day): Grafana Loki — log aggregation without indexing. Cheap, pairs perfectly with Grafana. Store in S3.
  • Medium scale (100GB-1TB/day): Elasticsearch (self-hosted) or Grafana Cloud. Balance of query speed and cost.
  • Large scale (>1TB/day): ClickHouse for logs, or cloud-native (CloudWatch Logs Insights, BigQuery). Cost-optimize with tiered storage.

Distributed Tracing: Following Requests Across Services

In a microservices architecture, a single user request might touch 10-15 services. When that request is slow, which service is the bottleneck? Without tracing, you're guessing. With tracing, you see the exact waterfall of service calls and where time is spent.

How Tracing Works

Every incoming request gets a trace ID — a unique identifier that follows the request through every service it touches. Each service creates a span — a timed operation within the trace. Spans form a tree:

Trace: abc123
├── [200ms] API Gateway → /checkout
│   ├── [15ms] Auth Service → validate_token
│   ├── [45ms] Cart Service → get_cart
│   ├── [120ms] Payment Service → charge
│   │   ├── [5ms] Fraud Check → evaluate
│   │   └── [100ms] Stripe API → create_charge  ← bottleneck
│   └── [10ms] Order Service → create_order

At a glance: the checkout request takes 200ms total, and 100ms of that is waiting for the Stripe API. Optimization target: identified in seconds.

Sampling Strategies

At high volume, tracing every request is cost-prohibitive. Sampling is required:

  • Head-based sampling: Decide at the start whether to trace (e.g., 10% of requests). Simple but misses rare errors.
  • Tail-based sampling: Trace everything, but only export traces that are interesting (errors, high latency, specific conditions). Better coverage, higher resource cost at the edge.
  • Always sample errors: Regardless of your sampling rate, always export traces for failed requests. These are the ones you'll need to debug.

Our recommendation: tail-based sampling at 100% for errors and high latency, 5-10% for normal requests. This gives you all the traces you need for debugging while controlling costs.

Alerting That Doesn't Cause Alert Fatigue

Bad alerting is worse than no alerting. If your on-call engineer gets 50 alerts per shift and 45 are noise, they'll start ignoring all of them — including the 5 that matter.

Alert Design Principles

  • Alert on symptoms, not causes: Alert on "error rate above 2%" not "CPU above 90%." High CPU might not cause user impact. High error rate always does.
  • Every alert must be actionable: If the on-call person can't do anything about it at 3 AM, it's not an alert — it's a notification. Route it to a dashboard or a ticket, not a page.
  • Include runbook links: Every alert should link to a runbook that says: what this alert means, how to diagnose, and how to mitigate.
  • Use multiple burn rate windows: SLO-based alerting uses fast burn (2% of budget in 1 hour) for urgent pages and slow burn (5% in 6 hours) for tickets.

The Alerting Hierarchy

Severity Channel Response Example
Page PagerDuty / phone call Immediate (wake up) Error budget burning at 14x rate
Ticket Slack + auto-create JIRA Business hours Error budget burning at 3x rate
Warning Dashboard highlight Next review Disk at 75%, trending upward
Info Log / dashboard only No action needed Deploy completed, traffic shift

Read more about incident response workflows in our SRE guide.

Observability Tool Comparison (2026)

Tool Type Best For Cost Model Our Take
Grafana Stack Full stack (OSS) Cost-conscious teams Free (self-host) or per-metric Best value. Prometheus + Loki + Tempo + Grafana. Our default recommendation.
Datadog Full stack (SaaS) Enterprise, all-in-one Per host + per GB ingested Best UX, most features. Expensive at scale. Watch the bill.
New Relic Full stack (SaaS) APM-focused teams Per user + per GB Great APM, generous free tier (100GB/mo). Good for startups.
Honeycomb Tracing + events Debugging complex systems Per event Best for high-cardinality exploration. Purpose-built for observability (not just monitoring).
Elastic (ELK) Logs + search Log-heavy workloads Free (self-host) or per GB Powerful search. Resource-hungry to self-host. Being replaced by Loki for many teams.
Cloud Native Per-cloud metrics/logs Single-cloud shops Pay-as-you-go CloudWatch, Cloud Monitoring, Azure Monitor. Good for basics, limited for cross-service analysis.

Our recommendation by team size:

  • 1-10 engineers: New Relic free tier or Grafana Cloud free tier. Get started fast, don't self-host yet.
  • 10-50 engineers: Grafana Stack (self-hosted on Kubernetes) or Grafana Cloud. Best cost-to-value ratio.
  • 50+ engineers: Datadog or Grafana Enterprise. The UX and integrations justify the cost at this scale.

OpenTelemetry: The Future of Instrumentation

OpenTelemetry (OTel) is the standard for generating telemetry data. It's vendor-neutral — instrument once, send to any backend. In 2026, if you're starting from scratch, use OTel. If you're using vendor-specific SDKs, plan your migration.

Why OpenTelemetry Matters

  • Vendor independence: Switch from Datadog to Grafana without re-instrumenting your code.
  • Unified API: One SDK for metrics, logs, and traces. Consistent instrumentation across all services.
  • Auto-instrumentation: For many frameworks (Express, Spring, Django, Flask), OTel auto-instruments HTTP, database, and gRPC calls with zero code changes.
  • Industry standard: Backed by every major observability vendor. The CNCF's second most active project after Kubernetes.

Getting Started with OTel

# Node.js auto-instrumentation (zero code changes)
npm install @opentelemetry/auto-instrumentations-node \
            @opentelemetry/exporter-trace-otlp-http \
            @opentelemetry/exporter-metrics-otlp-http

# tracing.js — run before your app
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require(
  '@opentelemetry/auto-instrumentations-node'
);
const { OTLPTraceExporter } = require(
  '@opentelemetry/exporter-trace-otlp-http'
);

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces'
  }),
  instrumentations: [getNodeAutoInstrumentations()]
});

sdk.start();

// Start your app: node -r ./tracing.js app.js

The OTel Collector sits between your application and your backend. It receives telemetry, processes it (sampling, filtering, enrichment), and exports to one or more destinations. This decouples your instrumentation from your backend choice.

Pillai Infotech approach: We use the OTel Collector as a central pipeline for all telemetry. Application → OTel Collector → (Prometheus for metrics, Loki for logs, Tempo for traces). This architecture lets us change backends without touching application code. For AI agent systems, we add custom spans for model inference calls, token counts, and agent decision points.

Building Your Observability Stack: A Practical Guide

Phase 1: Start with Metrics (Week 1-2)

  • Deploy Prometheus (or use Grafana Cloud Agent)
  • Instrument RED metrics for your top 3 services
  • Build a single Grafana dashboard showing service health
  • Set up basic alerts: error rate > 1%, latency p99 > 2s

Phase 2: Add Structured Logging (Week 3-4)

  • Migrate from unstructured to structured JSON logging
  • Deploy Loki (or your chosen log aggregator)
  • Include trace_id in every log line
  • Build log-based dashboards for error investigation

Phase 3: Add Distributed Tracing (Week 5-8)

  • Deploy OTel Collector + Tempo (or Jaeger)
  • Add OTel auto-instrumentation to all services
  • Configure sampling (100% errors, 10% normal)
  • Link traces to logs in Grafana (trace_id correlation)

Phase 4: SLO-Based Alerting (Week 9-12)

  • Define SLOs based on your metrics data
  • Implement multi-burn-rate alerting
  • Write runbooks for every alert
  • Reduce on-call noise to under 5 actionable alerts per week

For AI-powered approaches to monitoring, see our AIOps guide — anomaly detection and intelligent alert correlation can reduce noise by 80-90%.

Frequently Asked Questions

How much does observability cost?

Self-hosted Grafana stack (Prometheus + Loki + Tempo): infrastructure costs only — typically $200-500/month for a mid-size application. Datadog: $15-23 per host/month plus $0.10-1.70 per GB of logs. New Relic: free up to 100GB/month, then $0.35/GB. The biggest cost driver is log volume — control it by logging at the right level and sampling verbose services.

Should I use structured logging from day one?

Yes. Migrating from unstructured to structured logs later is painful — you have to update every log statement. Start with JSON logging from the beginning. Every modern logging library supports it (winston, pino, loguru, slog, zerolog).

Do I need tracing for a monolith?

Not distributed tracing, but request tracing within the monolith is valuable. A simple request_id that follows through all function calls and log statements gives you most of the debugging benefit. Add distributed tracing when you have 3+ services communicating over the network.

How do I reduce log costs?

Log at INFO level in production, not DEBUG. Sample verbose services (log 10% of health check requests). Use Loki instead of Elasticsearch — it stores logs in S3 without indexing, drastically reducing costs. Set retention policies (30 days for detailed, 90 days for aggregated metrics).

What's the best monitoring tool in 2026?

There is no single best tool — it depends on your team size, budget, and needs. For best value: Grafana Stack (open source). For best UX: Datadog. For best free tier: New Relic. For best debugging: Honeycomb. We recommend starting with Grafana Cloud's free tier and evaluating from there.

How do I monitor AI/LLM applications?

Add custom spans for model inference calls (track latency, tokens, model used, cost). Log prompts and responses (redacted as needed for privacy). Track business metrics: response quality scores, user satisfaction, hallucination rates. Tools like LangSmith, Langfuse, or custom OTel instrumentation work well. See our AI application guide for details.

Pillai Infotech Engineering Team

We build production software across AI, cloud, web, and mobile — sharing real-world insights from projects delivered for startups and enterprises across India and globally.

Need Help Building Your Observability Stack?

Our DevOps engineers design monitoring and observability systems that give you real-time insight into your application's health — from Prometheus to Datadog, we've implemented them all.

Get a Free Consultation Cloud & DevOps Services