Monitoring Observability Modern Applications

Q: How much does observability cost?

Self-hosted Grafana stack: $200-500/month infrastructure. Datadog: $15-23 per host plus log ingestion. New Relic: free up to 100GB/month. The biggest cost driver is log volume.

Q: Should I use structured logging from day one?

Yes. Migrating later is painful. Start with JSON logging from the beginning. Every modern logging library supports it.

Q: Do I need tracing for a monolith?

Not distributed tracing, but a simple request_id through all function calls and logs gives most of the debugging benefit. Add distributed tracing when you have 3+ services.

Q: How do I reduce log costs?

Log at INFO not DEBUG in production. Sample verbose services. Use Loki instead of Elasticsearch. Set retention policies: 30 days detailed, 90 days aggregated.

Q: What's the best monitoring tool in 2026?

It depends on team size and budget. Best value: Grafana Stack. Best UX: Datadog. Best free tier: New Relic. Best debugging: Honeycomb. Start with Grafana Cloud's free tier.

Q: How do I monitor AI/LLM applications?

Add custom spans for inference calls tracking latency, tokens, and cost. Log prompts and responses with privacy redaction. Track quality scores and hallucination rates. Use LangSmith, Langfuse, or custom OTel.

In this article

Monitoring vs Observability
The Three Pillars
Metrics Done Right
Structured Logging
Distributed Tracing
Alerting That Works
Tool Comparison
OpenTelemetry
FAQ

At 2:47 AM, the on-call engineer gets paged: "High error rate on checkout service." They open the dashboard. Errors are spiking, but from where? They check logs — 10,000 lines per minute, no obvious pattern. They check the database — looks fine. They check the cache — fine. They SSH into servers and start grepping. Forty minutes later, they find it: a downstream payment provider changed their API response format. Mean time to resolution: 52 minutes. Customer impact: $180K in failed checkouts.

Now imagine the same incident with proper observability: The alert fires with a link to the exact trace showing the failing request. The trace shows the payment-service span returning a 422 error. The structured log attached to that span shows the exact payload mismatch. Time to diagnosis: 3 minutes. Time to mitigation (circuit breaker to fallback provider): 5 minutes. Customer impact: $8K.

That's the difference between monitoring and observability. At Pillai Infotech, we've built observability stacks for applications handling millions of requests daily. This guide covers what actually works.

Monitoring vs Observability: What's the Difference?

Monitoring answers: "Is the system working?" It checks predefined conditions — CPU above 90%, error rate above 1%, disk above 80% — and alerts when thresholds are crossed. Monitoring works for known failure modes.

Observability answers: "Why is the system broken?" It gives you the ability to ask arbitrary questions about your system's behavior without deploying new code. Observability works for unknown failure modes — the failures you didn't predict.

In a monolith, monitoring is often sufficient. In distributed systems (microservices, serverless, event-driven architectures), monitoring alone fails because failures emerge from interactions between services that no single metric captures. You need observability.

Put simply: Monitoring is for dashboards. Observability is for debugging. You need both — dashboards to know something is wrong, and the ability to drill into why.

The Three Pillars (and Why They're Not Enough)

The traditional "three pillars of observability" are metrics, logs, and traces. They're necessary but not sufficient on their own — the real power comes from correlating all three.

Pillar	What It Tells You	Strength	Weakness
Metrics	System state over time (numeric)	Cheap, fast, good for alerting	Low cardinality — can't drill into specifics
Logs	What happened (events)	Rich context, human-readable	Expensive at scale, hard to correlate
Traces	Request flow across services	Shows causality and latency breakdown	Sampling required at scale, complex setup

The emerging "fourth pillar" is profiles — continuous profiling that shows where CPU and memory are spent at the function level. Tools like Pyroscope and Grafana Phlare make this practical. But start with the three pillars — profiling is an optimization, not a foundation.

Metrics Done Right

The RED Method (for Services)

For every service, track these three metrics:

Rate: Requests per second. How much traffic is the service handling?
Errors: Errors per second (or error rate). What percentage of requests are failing?
Duration: Latency distribution (p50, p95, p99). How long do requests take?

RED gives you immediate insight into service health. A spike in errors or latency, or a drop in rate, tells you something is wrong — fast.

The USE Method (for Infrastructure)

For every infrastructure resource (CPU, memory, disk, network):

Utilization: How busy is the resource? (CPU: 75%)
Saturation: Is work queuing up? (CPU run queue: 15)
Errors: Is the resource producing errors? (disk: I/O errors)

What NOT to Measure

Over-instrumentation is as harmful as under-instrumentation. Every metric costs storage, query time, and cognitive load. Avoid:

Vanity metrics: Total requests all-time. Useless for diagnosing issues.
High-cardinality labels: Don't label metrics by user ID or request ID — this creates millions of time series and crashes your monitoring system.
Internal implementation details: Don't expose metrics about internal queues or cache implementations that change frequently. Focus on external-facing SLIs.

Histogram vs. Summary

For latency, use histograms (not summaries). Histograms are aggregatable across instances — you can compute accurate percentiles for the entire service, not just individual pods. In Prometheus:

# Good — histogram with meaningful buckets
http_request_duration_seconds_bucket{le="0.01"} 2834
http_request_duration_seconds_bucket{le="0.05"} 8923
http_request_duration_seconds_bucket{le="0.1"}  12847
http_request_duration_seconds_bucket{le="0.5"}  14231
http_request_duration_seconds_bucket{le="1.0"}  14298
http_request_duration_seconds_bucket{le="+Inf"} 14305

# Query p99 latency:
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

Structured Logging: The Foundation

Stop Writing Unstructured Logs

This log line is nearly useless for automated analysis:

2026-03-11 14:23:45 ERROR - Failed to process order #12345 for user john@example.com: payment declined

This structured log is queryable, filterable, and correlatable:

{
  "timestamp": "2026-03-11T14:23:45.123Z",
  "level": "error",
  "service": "order-service",
  "trace_id": "abc123def456",
  "span_id": "789ghi",
  "message": "Payment declined",
  "order_id": "12345",
  "user_id": "u_9876",
  "payment_provider": "stripe",
  "error_code": "card_declined",
  "amount_cents": 4999,
  "currency": "USD"
}

With structured logs, you can query: "Show me all payment failures from Stripe in the last hour where amount > $100" — impossible with plain text logs.

Logging Best Practices

Always include trace_id: Every log line should carry the distributed trace ID. This is the link between logs and traces.
Log at the right level: ERROR = needs human attention. WARN = unexpected but handled. INFO = significant state changes. DEBUG = development only (never in production).
Don't log PII: User emails, IP addresses, and payment details should be redacted or hashed. Compliance (GDPR, HIPAA) requires it. Use structured logging libraries that support field-level redaction.
Log actionable context: "Database connection failed" is unhelpful. "Database connection failed: host=db-primary.internal, port=5432, pool_size=20, active_connections=20, wait_queue=47" tells you exactly what happened.

Log Aggregation Architecture

For most teams, we recommend:

Small scale (<100GB/day): Grafana Loki — log aggregation without indexing. Cheap, pairs perfectly with Grafana. Store in S3.
Medium scale (100GB-1TB/day): Elasticsearch (self-hosted) or Grafana Cloud. Balance of query speed and cost.
Large scale (>1TB/day): ClickHouse for logs, or cloud-native (CloudWatch Logs Insights, BigQuery). Cost-optimize with tiered storage.

Distributed Tracing: Following Requests Across Services

In a microservices architecture, a single user request might touch 10-15 services. When that request is slow, which service is the bottleneck? Without tracing, you're guessing. With tracing, you see the exact waterfall of service calls and where time is spent.

How Tracing Works

Every incoming request gets a trace ID — a unique identifier that follows the request through every service it touches. Each service creates a span — a timed operation within the trace. Spans form a tree:

Trace: abc123
├── [200ms] API Gateway → /checkout
│   ├── [15ms] Auth Service → validate_token
│   ├── [45ms] Cart Service → get_cart
│   ├── [120ms] Payment Service → charge
│   │   ├── [5ms] Fraud Check → evaluate
│   │   └── [100ms] Stripe API → create_charge  ← bottleneck
│   └── [10ms] Order Service → create_order

At a glance: the checkout request takes 200ms total, and 100ms of that is waiting for the Stripe API. Optimization target: identified in seconds.

Sampling Strategies

At high volume, tracing every request is cost-prohibitive. Sampling is required:

Head-based sampling: Decide at the start whether to trace (e.g., 10% of requests). Simple but misses rare errors.
Tail-based sampling: Trace everything, but only export traces that are interesting (errors, high latency, specific conditions). Better coverage, higher resource cost at the edge.
Always sample errors: Regardless of your sampling rate, always export traces for failed requests. These are the ones you'll need to debug.

Our recommendation: tail-based sampling at 100% for errors and high latency, 5-10% for normal requests. This gives you all the traces you need for debugging while controlling costs.

Alerting That Doesn't Cause Alert Fatigue

Bad alerting is worse than no alerting. If your on-call engineer gets 50 alerts per shift and 45 are noise, they'll start ignoring all of them — including the 5 that matter.

Alert Design Principles

Alert on symptoms, not causes: Alert on "error rate above 2%" not "CPU above 90%." High CPU might not cause user impact. High error rate always does.
Every alert must be actionable: If the on-call person can't do anything about it at 3 AM, it's not an alert — it's a notification. Route it to a dashboard or a ticket, not a page.
Include runbook links: Every alert should link to a runbook that says: what this alert means, how to diagnose, and how to mitigate.
Use multiple burn rate windows: SLO-based alerting uses fast burn (2% of budget in 1 hour) for urgent pages and slow burn (5% in 6 hours) for tickets.

The Alerting Hierarchy

Severity	Channel	Response	Example
Page	PagerDuty / phone call	Immediate (wake up)	Error budget burning at 14x rate
Ticket	Slack + auto-create JIRA	Business hours	Error budget burning at 3x rate
Warning	Dashboard highlight	Next review	Disk at 75%, trending upward
Info	Log / dashboard only	No action needed	Deploy completed, traffic shift

Read more about incident response workflows in our SRE guide.

Observability Tool Comparison (2026)

Tool	Type	Best For	Cost Model	Our Take
Grafana Stack	Full stack (OSS)	Cost-conscious teams	Free (self-host) or per-metric	Best value. Prometheus + Loki + Tempo + Grafana. Our default recommendation.
Datadog	Full stack (SaaS)	Enterprise, all-in-one	Per host + per GB ingested	Best UX, most features. Expensive at scale. Watch the bill.
New Relic	Full stack (SaaS)	APM-focused teams	Per user + per GB	Great APM, generous free tier (100GB/mo). Good for startups.
Honeycomb	Tracing + events	Debugging complex systems	Per event	Best for high-cardinality exploration. Purpose-built for observability (not just monitoring).
Elastic (ELK)	Logs + search	Log-heavy workloads	Free (self-host) or per GB	Powerful search. Resource-hungry to self-host. Being replaced by Loki for many teams.
Cloud Native	Per-cloud metrics/logs	Single-cloud shops	Pay-as-you-go	CloudWatch, Cloud Monitoring, Azure Monitor. Good for basics, limited for cross-service analysis.

Our recommendation by team size:

1-10 engineers: New Relic free tier or Grafana Cloud free tier. Get started fast, don't self-host yet.
10-50 engineers: Grafana Stack (self-hosted on Kubernetes) or Grafana Cloud. Best cost-to-value ratio.
50+ engineers: Datadog or Grafana Enterprise. The UX and integrations justify the cost at this scale.

OpenTelemetry: The Future of Instrumentation

OpenTelemetry (OTel) is the standard for generating telemetry data. It's vendor-neutral — instrument once, send to any backend. In 2026, if you're starting from scratch, use OTel. If you're using vendor-specific SDKs, plan your migration.

Why OpenTelemetry Matters

Vendor independence: Switch from Datadog to Grafana without re-instrumenting your code.
Unified API: One SDK for metrics, logs, and traces. Consistent instrumentation across all services.
Auto-instrumentation: For many frameworks (Express, Spring, Django, Flask), OTel auto-instruments HTTP, database, and gRPC calls with zero code changes.
Industry standard: Backed by every major observability vendor. The CNCF's second most active project after Kubernetes.

Getting Started with OTel

# Node.js auto-instrumentation (zero code changes)
npm install @opentelemetry/auto-instrumentations-node \
            @opentelemetry/exporter-trace-otlp-http \
            @opentelemetry/exporter-metrics-otlp-http

# tracing.js — run before your app
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require(
  '@opentelemetry/auto-instrumentations-node'
);
const { OTLPTraceExporter } = require(
  '@opentelemetry/exporter-trace-otlp-http'
);

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces'
  }),
  instrumentations: [getNodeAutoInstrumentations()]
});

sdk.start();

// Start your app: node -r ./tracing.js app.js

The OTel Collector sits between your application and your backend. It receives telemetry, processes it (sampling, filtering, enrichment), and exports to one or more destinations. This decouples your instrumentation from your backend choice.

Pillai Infotech approach: We use the OTel Collector as a central pipeline for all telemetry. Application → OTel Collector → (Prometheus for metrics, Loki for logs, Tempo for traces). This architecture lets us change backends without touching application code. For AI agent systems, we add custom spans for model inference calls, token counts, and agent decision points.

Building Your Observability Stack: A Practical Guide

Phase 1: Start with Metrics (Week 1-2)

Deploy Prometheus (or use Grafana Cloud Agent)
Instrument RED metrics for your top 3 services
Build a single Grafana dashboard showing service health
Set up basic alerts: error rate > 1%, latency p99 > 2s

Phase 2: Add Structured Logging (Week 3-4)

Migrate from unstructured to structured JSON logging
Deploy Loki (or your chosen log aggregator)
Include trace_id in every log line
Build log-based dashboards for error investigation

Phase 3: Add Distributed Tracing (Week 5-8)

Deploy OTel Collector + Tempo (or Jaeger)
Add OTel auto-instrumentation to all services
Configure sampling (100% errors, 10% normal)
Link traces to logs in Grafana (trace_id correlation)

Phase 4: SLO-Based Alerting (Week 9-12)

Define SLOs based on your metrics data
Implement multi-burn-rate alerting
Write runbooks for every alert
Reduce on-call noise to under 5 actionable alerts per week

For AI-powered approaches to monitoring, see our AIOps guide — anomaly detection and intelligent alert correlation can reduce noise by 80-90%.

Frequently Asked Questions

How much does observability cost?

Self-hosted Grafana stack (Prometheus + Loki + Tempo): infrastructure costs only — typically $200-500/month for a mid-size application. Datadog: $15-23 per host/month plus $0.10-1.70 per GB of logs. New Relic: free up to 100GB/month, then $0.35/GB. The biggest cost driver is log volume — control it by logging at the right level and sampling verbose services.

Should I use structured logging from day one?

Yes. Migrating from unstructured to structured logs later is painful — you have to update every log statement. Start with JSON logging from the beginning. Every modern logging library supports it (winston, pino, loguru, slog, zerolog).

Do I need tracing for a monolith?

Not distributed tracing, but request tracing within the monolith is valuable. A simple request_id that follows through all function calls and log statements gives you most of the debugging benefit. Add distributed tracing when you have 3+ services communicating over the network.

How do I reduce log costs?

Log at INFO level in production, not DEBUG. Sample verbose services (log 10% of health check requests). Use Loki instead of Elasticsearch — it stores logs in S3 without indexing, drastically reducing costs. Set retention policies (30 days for detailed, 90 days for aggregated metrics).

What's the best monitoring tool in 2026?

There is no single best tool — it depends on your team size, budget, and needs. For best value: Grafana Stack (open source). For best UX: Datadog. For best free tier: New Relic. For best debugging: Honeycomb. We recommend starting with Grafana Cloud's free tier and evaluating from there.

How do I monitor AI/LLM applications?

Add custom spans for model inference calls (track latency, tokens, model used, cost). Log prompts and responses (redacted as needed for privacy). Track business metrics: response quality scores, user satisfaction, hallucination rates. Tools like LangSmith, Langfuse, or custom OTel instrumentation work well. See our AI application guide for details.

Monitoring and Observability for Modern Applications

Monitoring vs Observability: What's the Difference?

The Three Pillars (and Why They're Not Enough)

Metrics Done Right

The RED Method (for Services)

The USE Method (for Infrastructure)

What NOT to Measure

Histogram vs. Summary

Structured Logging: The Foundation

Stop Writing Unstructured Logs

Logging Best Practices

Log Aggregation Architecture

Distributed Tracing: Following Requests Across Services

How Tracing Works

Sampling Strategies

Alerting That Doesn't Cause Alert Fatigue

Alert Design Principles

The Alerting Hierarchy

Observability Tool Comparison (2026)

OpenTelemetry: The Future of Instrumentation

Why OpenTelemetry Matters

Getting Started with OTel

Building Your Observability Stack: A Practical Guide

Phase 1: Start with Metrics (Week 1-2)

Phase 2: Add Structured Logging (Week 3-4)

Phase 3: Add Distributed Tracing (Week 5-8)

Phase 4: SLO-Based Alerting (Week 9-12)

Frequently Asked Questions

How much does observability cost?

Should I use structured logging from day one?

Do I need tracing for a monolith?

How do I reduce log costs?

What's the best monitoring tool in 2026?

How do I monitor AI/LLM applications?

Related Articles

Site Reliability Engineering Guide

AIOps: AI Transforming IT Operations

DevOps Best Practices

Pillai Infotech Engineering Team

Need Help Building Your Observability Stack?

Related Articles

SRE
Site Reliability Engineering Guide

SLOs, error budgets, incident management, and the practices that keep systems running at 99.9%+.

AIOps
AIOps: AI Transforming IT Operations

Automate anomaly detection, alert correlation, and root cause analysis with AI.

DevOps
DevOps Best Practices

The full DevOps playbook — CI/CD, IaC, monitoring, and incident response.

Monitoring and Observability for Modern Applications

Monitoring vs Observability: What's the Difference?

The Three Pillars (and Why They're Not Enough)

Metrics Done Right

The RED Method (for Services)

The USE Method (for Infrastructure)

What NOT to Measure

Histogram vs. Summary

Structured Logging: The Foundation

Stop Writing Unstructured Logs

Logging Best Practices

Log Aggregation Architecture

Distributed Tracing: Following Requests Across Services

How Tracing Works

Sampling Strategies

Alerting That Doesn't Cause Alert Fatigue

Alert Design Principles

The Alerting Hierarchy

Observability Tool Comparison (2026)

OpenTelemetry: The Future of Instrumentation

Why OpenTelemetry Matters

Getting Started with OTel

Building Your Observability Stack: A Practical Guide

Phase 1: Start with Metrics (Week 1-2)

Phase 2: Add Structured Logging (Week 3-4)

Phase 3: Add Distributed Tracing (Week 5-8)

Phase 4: SLO-Based Alerting (Week 9-12)

Frequently Asked Questions

How much does observability cost?

Should I use structured logging from day one?

Do I need tracing for a monolith?

How do I reduce log costs?

What's the best monitoring tool in 2026?

How do I monitor AI/LLM applications?

Related Articles

Site Reliability Engineering Guide

AIOps: AI Transforming IT Operations

DevOps Best Practices

Pillai Infotech Engineering Team

Need Help Building Your Observability Stack?

Book a Free Consultation

Your Details

Pick a 30-min Slot

Thank You!