Ideas Engineered for Tomorrow
We Engineer Services & Solutions for Your Business Needs
Home About
Products
Services
Hire
Industries
Consulting
Partners
Articles Careers Contact
AI & Automation

AI Observability: Monitoring LLMs in Production

Your LLM works great in the demo. In production, it's hallucinating, costs are spiking, and latency is 8 seconds. Here's how to monitor AI systems that don't have traditional error logs.

September 24, 2025 12 min read AI & Machine Learning

Traditional software either works or crashes. LLMs always "work" — they always generate output. The question is whether that output is correct, relevant, safe, and cost-effective. You can't monitor an LLM the way you monitor a REST API. There's no HTTP 500 for a hallucinated answer. No stack trace for a response that's technically correct but unhelpful. AI observability requires entirely new metrics, tools, and workflows. This guide covers what to monitor, how to monitor it, and which tools are worth the investment.

Why AI Observability Is Different from Traditional APM

Traditional observability (Datadog, New Relic, Grafana) monitors: uptime, latency, error rates, throughput. These still matter for AI systems, but they're not enough. An LLM can have 100% uptime, 200ms latency, 0% error rate, and still be giving customers terrible answers.

Metric Category Traditional Software AI / LLM Systems
Correctness Binary — correct output or error Spectrum — from perfect to subtly wrong to hallucinated
Performance Latency in ms Latency + token throughput + time-to-first-token
Cost Infrastructure cost (predictable) Per-request variable cost (tokens × price). Can spike unpredictably.
Quality Functional tests pass/fail Relevance, helpfulness, safety, tone — all subjective and context-dependent
Degradation Obvious — errors, crashes, timeouts Silent — model quality drifts, prompt injection succeeds, cost creeps up

What to Monitor: The AI Observability Stack

Layer 1: Infrastructure (Same as Traditional)

  • API latency (p50, p95, p99)
  • Error rates (API errors, timeout rates, rate limit hits)
  • Throughput (requests/second)
  • Provider uptime and status (OpenAI, Anthropic, Google have had outages)

Layer 2: Cost and Tokens (AI-Specific)

Metric Why It Matters Alert Threshold
Cost per request Identifies expensive prompts, runaway costs Alert when average exceeds 2x baseline
Daily/weekly spend Budget tracking, anomaly detection Alert at 80% of daily budget
Input tokens per request Detects prompt bloat (context growing over time) Alert when > 150% of design target
Output tokens per request Detects verbose responses, runaway generation Alert when > 2x expected length
Cache hit rate If you cache responses, track effectiveness Target > 30% for conversational apps

Layer 3: Quality (The Hard Part)

  • Relevance score: Is the response relevant to the user's question? Measured by a separate LLM judge or human evaluation.
  • Factual accuracy: Does the response contain verifiable claims? Are they correct? Requires ground-truth comparison or RAG source citation checking.
  • Safety violations: Did the response contain harmful content, expose PII, or bypass safety guardrails?
  • Tone appropriateness: For customer-facing AI, is the tone matching the expected register? (Professional for enterprise, casual for consumer.)

Hallucination Detection

Hallucination — the model generating confident-sounding but incorrect information — is the #1 production risk for LLM applications. Detection approaches:

1. Source Attribution (For RAG Systems)

If your system uses retrieval-augmented generation, require the model to cite sources for every claim. Then programmatically verify: does the cited source actually support the claim? This catches the most common hallucination in RAG systems: the model generating an answer that's plausible but not actually in the retrieved documents.

// Hallucination check for RAG responses
function checkHallucination(response, retrievedDocs) {
  // Extract claims from response
  const claims = extractClaims(response);

  for (const claim of claims) {
    // Check if any retrieved document supports this claim
    const supported = retrievedDocs.some(doc =>
      semanticSimilarity(claim, doc.content) > 0.85
    );

    if (!supported) {
      flagForReview({
        claim,
        confidence: 'unsupported',
        response_id: response.id
      });
    }
  }
}

2. LLM-as-Judge

Use a second LLM to evaluate the first LLM's output. "Given this context [retrieved docs], is this response [model output] factually accurate and well-supported?" This adds cost (you're running two LLM calls per request) but catches a meaningful percentage of hallucinations. Use it on a sample of responses (10-20%), not every request.

3. Confidence Calibration

Some models can be prompted to express confidence levels. "On a scale of 1-5, how confident are you in this answer?" Responses with low self-assessed confidence can be flagged for human review or given a disclaimer to the user.

Cost Monitoring and Optimization

AI costs are variable and can spike unexpectedly. We track costs for our CMD Center's 17 AI agents daily:

// Our cost tracking approach (simplified)
// Every API call logs: model, input_tokens, output_tokens, cost

Daily cost report:
  ├── By model:     Claude Opus: $2.30 | Haiku: $0.45 | DeepSeek: $0.00
  ├── By agent:     CEO: $1.20 | CTO: $0.80 | Dev: $0.00 | Analytics: $0.15
  ├── By task type:  Planning: $1.50 | Execution: $0.80 | Review: $0.45
  └── Total:         $2.75/day (target: < $3.00)

Alerts:
  - Daily spend > $5.00 → Telegram notification
  - Single request > $0.50 → Log for review
  - Weekly trend > 20% increase → Investigate

Cost optimization techniques that work:

  • Model tiering: Use expensive models (Opus) only for complex tasks. Route simple tasks to cheaper models (Haiku, Gemini Flash). Our CMD Center uses 6 different model tiers based on task complexity.
  • Prompt optimization: Shorter, more precise prompts = fewer input tokens = lower cost. A 50% reduction in prompt length can save 30-40% on costs.
  • Response caching: Cache identical or semantically similar queries. For FAQ-style applications, cache hit rates of 30-50% are achievable.
  • Batch processing: Some providers offer batch pricing at 50% discount for non-real-time workloads.

Building Feedback Loops

The most important observability practice: collecting user feedback and using it to improve the system.

  1. Thumbs up/down on responses. Simple, low-friction. Gives you a quality signal at scale. Target: < 5% negative feedback rate.
  2. Follow-up behavior. If a user immediately rephrases the same question, the first response probably wasn't helpful. Track "rephrase rate" as a proxy for quality.
  3. Escalation to human. If the user asks to speak to a human after AI interaction, the AI failed. Track escalation rate and analyze the failed conversations.
  4. Weekly review of low-rated responses. Have a human review the worst 20 responses each week. Identify patterns: is it a specific topic? A specific prompt template? A data gap in the knowledge base?

Tools Comparison (2026)

Tool Best For Key Features Pricing
LangSmith LangChain-based applications Trace visualization, evaluation datasets, prompt playground Free tier / $39+/mo
Langfuse Open-source LLM observability Self-hostable, traces, cost tracking, evaluation Free (self-hosted) / $59+/mo (cloud)
Arize Phoenix ML + LLM observability Embedding visualization, drift detection, A/B evaluation Free (open-source) / Commercial
Helicone LLM proxy with observability Drop-in proxy, cost tracking, rate limiting, caching Free tier / $20+/mo
Weights & Biases (W&B) ML experiment tracking + LLM eval Experiment comparison, model registry, Weave for LLM traces Free tier / $50+/mo
Custom (OpenTelemetry) Full control, existing observability stack Custom spans for LLM calls, integrates with Grafana/Datadog Infrastructure cost only

For teams getting started: Langfuse (self-hosted) or Helicone (cloud proxy). Both are lightweight, provide the essential metrics (traces, cost, latency), and don't require significant integration effort. For enterprise: LangSmith if you're using LangChain, Arize if you need drift detection for ML models alongside LLM monitoring.

Frequently Asked Questions

How do we evaluate LLM quality at scale?

Three-tier approach: (1) Automated metrics (BLEU, ROUGE for summarization; exact match for structured outputs) catch obvious failures. (2) LLM-as-judge evaluates a sample of responses for relevance, accuracy, and helpfulness. (3) Human review of flagged responses and periodic random samples. Budget for 5-10% of responses to get human evaluation in the first 3 months.

What's a reasonable cost per LLM request?

Depends on the use case. Customer-facing chatbot: $0.01-0.05 per conversation (using Haiku/Flash for simple queries, Sonnet for complex). Internal tool: $0.05-0.20 per request is acceptable if it saves human time. Code generation: $0.10-0.50 per suggestion. Cost optimization should target 30-50% reduction from initial deployment through model tiering and prompt optimization.

How do we detect prompt injection in production?

Monitor for: (1) Outputs that reference system prompts or internal instructions. (2) Responses dramatically different in format or content from normal patterns. (3) Requests with suspicious patterns (encoded instructions, role-play prompts). Use input classifiers to flag potential injection attempts before they reach the LLM. Log and review all flagged inputs weekly.

Can we use traditional APM tools for LLM monitoring?

Partially. Datadog, New Relic, and Grafana handle infrastructure metrics well. But they lack LLM-specific features: token tracking, prompt/response logging, hallucination detection, and evaluation frameworks. The best approach: use your existing APM for infrastructure + a specialized LLM tool (Langfuse, Helicone) for AI-specific metrics. They complement, not replace, each other.

PI
Pillai Infotech Team

AI Operations & Observability

We monitor 17 AI agents in production daily — tracking cost, quality, and performance across 6 model tiers. Our token_usage table logs every API call. Build your AI observability stack.