How to evaluate LLM quality at scale?

Three tiers: automated metrics for obvious failures, LLM-as-judge on samples, human review of flagged responses. Budget 5-10% human evaluation in first 3 months.

How to detect prompt injection?

Monitor for system prompt references, abnormal output patterns, and suspicious input patterns. Use input classifiers to flag attempts. Log and review weekly.

Can we use traditional APM for LLM monitoring?

Partially — for infrastructure. Add specialized LLM tools (Langfuse, Helicone) for token tracking, prompt logging, hallucination detection. They complement, not replace, APM.

Ai Observability Monitoring Llms | Pillai Infotech LLP

Q: What's a reasonable cost per LLM request?

Chatbot: $0.01-0.05/conversation. Internal tool: $0.05-0.20. Code gen: $0.10-0.50. Target 30-50% reduction through model tiering and prompt optimization.

AI Observability: Monitoring LLMs in Production

Your LLM works great in the demo. In production, it's hallucinating, costs are spiking, and latency is 8 seconds. Here's how to monitor AI systems that don't have traditional error logs.

September 24, 2025 12 min read AI & Machine Learning

Traditional software either works or crashes. LLMs always "work" — they always generate output. The question is whether that output is correct, relevant, safe, and cost-effective. You can't monitor an LLM the way you monitor a REST API. There's no HTTP 500 for a hallucinated answer. No stack trace for a response that's technically correct but unhelpful. AI observability requires entirely new metrics, tools, and workflows. This guide covers what to monitor, how to monitor it, and which tools are worth the investment.

Why AI Observability Is Different
What to Monitor
Hallucination Detection
Cost Monitoring
Feedback Loops
Tools Comparison
FAQ

Why AI Observability Is Different from Traditional APM

Traditional observability (Datadog, New Relic, Grafana) monitors: uptime, latency, error rates, throughput. These still matter for AI systems, but they're not enough. An LLM can have 100% uptime, 200ms latency, 0% error rate, and still be giving customers terrible answers.

Metric Category	Traditional Software	AI / LLM Systems
Correctness	Binary — correct output or error	Spectrum — from perfect to subtly wrong to hallucinated
Performance	Latency in ms	Latency + token throughput + time-to-first-token
Cost	Infrastructure cost (predictable)	Per-request variable cost (tokens × price). Can spike unpredictably.
Quality	Functional tests pass/fail	Relevance, helpfulness, safety, tone — all subjective and context-dependent
Degradation	Obvious — errors, crashes, timeouts	Silent — model quality drifts, prompt injection succeeds, cost creeps up

What to Monitor: The AI Observability Stack

Layer 1: Infrastructure (Same as Traditional)

API latency (p50, p95, p99)
Error rates (API errors, timeout rates, rate limit hits)
Throughput (requests/second)
Provider uptime and status (OpenAI, Anthropic, Google have had outages)

Layer 2: Cost and Tokens (AI-Specific)

Metric	Why It Matters	Alert Threshold
Cost per request	Identifies expensive prompts, runaway costs	Alert when average exceeds 2x baseline
Daily/weekly spend	Budget tracking, anomaly detection	Alert at 80% of daily budget
Input tokens per request	Detects prompt bloat (context growing over time)	Alert when > 150% of design target
Output tokens per request	Detects verbose responses, runaway generation	Alert when > 2x expected length
Cache hit rate	If you cache responses, track effectiveness	Target > 30% for conversational apps

Layer 3: Quality (The Hard Part)

Relevance score: Is the response relevant to the user's question? Measured by a separate LLM judge or human evaluation.
Factual accuracy: Does the response contain verifiable claims? Are they correct? Requires ground-truth comparison or RAG source citation checking.
Safety violations: Did the response contain harmful content, expose PII, or bypass safety guardrails?
Tone appropriateness: For customer-facing AI, is the tone matching the expected register? (Professional for enterprise, casual for consumer.)

Hallucination Detection

Hallucination — the model generating confident-sounding but incorrect information — is the #1 production risk for LLM applications. Detection approaches:

1. Source Attribution (For RAG Systems)

If your system uses retrieval-augmented generation, require the model to cite sources for every claim. Then programmatically verify: does the cited source actually support the claim? This catches the most common hallucination in RAG systems: the model generating an answer that's plausible but not actually in the retrieved documents.

// Hallucination check for RAG responses
function checkHallucination(response, retrievedDocs) {
  // Extract claims from response
  const claims = extractClaims(response);

  for (const claim of claims) {
    // Check if any retrieved document supports this claim
    const supported = retrievedDocs.some(doc =>
      semanticSimilarity(claim, doc.content) > 0.85
    );

    if (!supported) {
      flagForReview({
        claim,
        confidence: 'unsupported',
        response_id: response.id
      });
    }
  }
}

2. LLM-as-Judge

Use a second LLM to evaluate the first LLM's output. "Given this context [retrieved docs], is this response [model output] factually accurate and well-supported?" This adds cost (you're running two LLM calls per request) but catches a meaningful percentage of hallucinations. Use it on a sample of responses (10-20%), not every request.

3. Confidence Calibration

Some models can be prompted to express confidence levels. "On a scale of 1-5, how confident are you in this answer?" Responses with low self-assessed confidence can be flagged for human review or given a disclaimer to the user.

Cost Monitoring and Optimization

AI costs are variable and can spike unexpectedly. We track costs for our CMD Center's 17 AI agents daily:

// Our cost tracking approach (simplified)
// Every API call logs: model, input_tokens, output_tokens, cost

Daily cost report:
  ├── By model:     Claude Opus: $2.30 | Haiku: $0.45 | DeepSeek: $0.00
  ├── By agent:     CEO: $1.20 | CTO: $0.80 | Dev: $0.00 | Analytics: $0.15
  ├── By task type:  Planning: $1.50 | Execution: $0.80 | Review: $0.45
  └── Total:         $2.75/day (target: < $3.00)

Alerts:
  - Daily spend > $5.00 → Telegram notification
  - Single request > $0.50 → Log for review
  - Weekly trend > 20% increase → Investigate

Cost optimization techniques that work:

Model tiering: Use expensive models (Opus) only for complex tasks. Route simple tasks to cheaper models (Haiku, Gemini Flash). Our CMD Center uses 6 different model tiers based on task complexity.
Prompt optimization: Shorter, more precise prompts = fewer input tokens = lower cost. A 50% reduction in prompt length can save 30-40% on costs.
Response caching: Cache identical or semantically similar queries. For FAQ-style applications, cache hit rates of 30-50% are achievable.
Batch processing: Some providers offer batch pricing at 50% discount for non-real-time workloads.

Building Feedback Loops

The most important observability practice: collecting user feedback and using it to improve the system.

Thumbs up/down on responses. Simple, low-friction. Gives you a quality signal at scale. Target: < 5% negative feedback rate.
Follow-up behavior. If a user immediately rephrases the same question, the first response probably wasn't helpful. Track "rephrase rate" as a proxy for quality.
Escalation to human. If the user asks to speak to a human after AI interaction, the AI failed. Track escalation rate and analyze the failed conversations.
Weekly review of low-rated responses. Have a human review the worst 20 responses each week. Identify patterns: is it a specific topic? A specific prompt template? A data gap in the knowledge base?

Tools Comparison (2026)

Tool	Best For	Key Features	Pricing
LangSmith	LangChain-based applications	Trace visualization, evaluation datasets, prompt playground	Free tier / $39+/mo
Langfuse	Open-source LLM observability	Self-hostable, traces, cost tracking, evaluation	Free (self-hosted) / $59+/mo (cloud)
Arize Phoenix	ML + LLM observability	Embedding visualization, drift detection, A/B evaluation	Free (open-source) / Commercial
Helicone	LLM proxy with observability	Drop-in proxy, cost tracking, rate limiting, caching	Free tier / $20+/mo
Weights & Biases (W&B)	ML experiment tracking + LLM eval	Experiment comparison, model registry, Weave for LLM traces	Free tier / $50+/mo
Custom (OpenTelemetry)	Full control, existing observability stack	Custom spans for LLM calls, integrates with Grafana/Datadog	Infrastructure cost only

For teams getting started: Langfuse (self-hosted) or Helicone (cloud proxy). Both are lightweight, provide the essential metrics (traces, cost, latency), and don't require significant integration effort. For enterprise: LangSmith if you're using LangChain, Arize if you need drift detection for ML models alongside LLM monitoring.

Frequently Asked Questions

How do we evaluate LLM quality at scale?

Three-tier approach: (1) Automated metrics (BLEU, ROUGE for summarization; exact match for structured outputs) catch obvious failures. (2) LLM-as-judge evaluates a sample of responses for relevance, accuracy, and helpfulness. (3) Human review of flagged responses and periodic random samples. Budget for 5-10% of responses to get human evaluation in the first 3 months.

What's a reasonable cost per LLM request?

Depends on the use case. Customer-facing chatbot: $0.01-0.05 per conversation (using Haiku/Flash for simple queries, Sonnet for complex). Internal tool: $0.05-0.20 per request is acceptable if it saves human time. Code generation: $0.10-0.50 per suggestion. Cost optimization should target 30-50% reduction from initial deployment through model tiering and prompt optimization.

How do we detect prompt injection in production?

Monitor for: (1) Outputs that reference system prompts or internal instructions. (2) Responses dramatically different in format or content from normal patterns. (3) Requests with suspicious patterns (encoded instructions, role-play prompts). Use input classifiers to flag potential injection attempts before they reach the LLM. Log and review all flagged inputs weekly.

Can we use traditional APM tools for LLM monitoring?

Partially. Datadog, New Relic, and Grafana handle infrastructure metrics well. But they lack LLM-specific features: token tracking, prompt/response logging, hallucination detection, and evaluation frameworks. The best approach: use your existing APM for infrastructure + a specialized LLM tool (Langfuse, Helicone) for AI-specific metrics. They complement, not replace, each other.

Pillai Infotech Team

AI Operations & Observability

We monitor 17 AI agents in production daily — tracking cost, quality, and performance across 6 model tiers. Our token_usage table logs every API call. Build your AI observability stack.

AI Observability: Monitoring LLMs in Production

Table of Contents

Why AI Observability Is Different from Traditional APM

What to Monitor: The AI Observability Stack

Layer 1: Infrastructure (Same as Traditional)

Layer 2: Cost and Tokens (AI-Specific)

Layer 3: Quality (The Hard Part)

Hallucination Detection

1. Source Attribution (For RAG Systems)

2. LLM-as-Judge

3. Confidence Calibration

Cost Monitoring and Optimization

Building Feedback Loops

Tools Comparison (2026)

Frequently Asked Questions

How do we evaluate LLM quality at scale?

What's a reasonable cost per LLM request?

How do we detect prompt injection in production?

Can we use traditional APM tools for LLM monitoring?

Pillai Infotech Team

Related Articles

AI Cost Optimization

MLOps Guide

RAG Guide

Book a Free Consultation

Your Details

Pick a 30-min Slot

Thank You!