Traditional software either works or crashes. LLMs always "work" — they always generate output. The question is whether that output is correct, relevant, safe, and cost-effective. You can't monitor an LLM the way you monitor a REST API. There's no HTTP 500 for a hallucinated answer. No stack trace for a response that's technically correct but unhelpful. AI observability requires entirely new metrics, tools, and workflows. This guide covers what to monitor, how to monitor it, and which tools are worth the investment.
Why AI Observability Is Different from Traditional APM
Traditional observability (Datadog, New Relic, Grafana) monitors: uptime, latency, error rates, throughput. These still matter for AI systems, but they're not enough. An LLM can have 100% uptime, 200ms latency, 0% error rate, and still be giving customers terrible answers.
| Metric Category | Traditional Software | AI / LLM Systems |
|---|---|---|
| Correctness | Binary — correct output or error | Spectrum — from perfect to subtly wrong to hallucinated |
| Performance | Latency in ms | Latency + token throughput + time-to-first-token |
| Cost | Infrastructure cost (predictable) | Per-request variable cost (tokens × price). Can spike unpredictably. |
| Quality | Functional tests pass/fail | Relevance, helpfulness, safety, tone — all subjective and context-dependent |
| Degradation | Obvious — errors, crashes, timeouts | Silent — model quality drifts, prompt injection succeeds, cost creeps up |
What to Monitor: The AI Observability Stack
Layer 1: Infrastructure (Same as Traditional)
- API latency (p50, p95, p99)
- Error rates (API errors, timeout rates, rate limit hits)
- Throughput (requests/second)
- Provider uptime and status (OpenAI, Anthropic, Google have had outages)
Layer 2: Cost and Tokens (AI-Specific)
| Metric | Why It Matters | Alert Threshold |
|---|---|---|
| Cost per request | Identifies expensive prompts, runaway costs | Alert when average exceeds 2x baseline |
| Daily/weekly spend | Budget tracking, anomaly detection | Alert at 80% of daily budget |
| Input tokens per request | Detects prompt bloat (context growing over time) | Alert when > 150% of design target |
| Output tokens per request | Detects verbose responses, runaway generation | Alert when > 2x expected length |
| Cache hit rate | If you cache responses, track effectiveness | Target > 30% for conversational apps |
Layer 3: Quality (The Hard Part)
- Relevance score: Is the response relevant to the user's question? Measured by a separate LLM judge or human evaluation.
- Factual accuracy: Does the response contain verifiable claims? Are they correct? Requires ground-truth comparison or RAG source citation checking.
- Safety violations: Did the response contain harmful content, expose PII, or bypass safety guardrails?
- Tone appropriateness: For customer-facing AI, is the tone matching the expected register? (Professional for enterprise, casual for consumer.)
Hallucination Detection
Hallucination — the model generating confident-sounding but incorrect information — is the #1 production risk for LLM applications. Detection approaches:
1. Source Attribution (For RAG Systems)
If your system uses retrieval-augmented generation, require the model to cite sources for every claim. Then programmatically verify: does the cited source actually support the claim? This catches the most common hallucination in RAG systems: the model generating an answer that's plausible but not actually in the retrieved documents.
// Hallucination check for RAG responses
function checkHallucination(response, retrievedDocs) {
// Extract claims from response
const claims = extractClaims(response);
for (const claim of claims) {
// Check if any retrieved document supports this claim
const supported = retrievedDocs.some(doc =>
semanticSimilarity(claim, doc.content) > 0.85
);
if (!supported) {
flagForReview({
claim,
confidence: 'unsupported',
response_id: response.id
});
}
}
}
2. LLM-as-Judge
Use a second LLM to evaluate the first LLM's output. "Given this context [retrieved docs], is this response [model output] factually accurate and well-supported?" This adds cost (you're running two LLM calls per request) but catches a meaningful percentage of hallucinations. Use it on a sample of responses (10-20%), not every request.
3. Confidence Calibration
Some models can be prompted to express confidence levels. "On a scale of 1-5, how confident are you in this answer?" Responses with low self-assessed confidence can be flagged for human review or given a disclaimer to the user.
Cost Monitoring and Optimization
AI costs are variable and can spike unexpectedly. We track costs for our CMD Center's 17 AI agents daily:
// Our cost tracking approach (simplified)
// Every API call logs: model, input_tokens, output_tokens, cost
Daily cost report:
├── By model: Claude Opus: $2.30 | Haiku: $0.45 | DeepSeek: $0.00
├── By agent: CEO: $1.20 | CTO: $0.80 | Dev: $0.00 | Analytics: $0.15
├── By task type: Planning: $1.50 | Execution: $0.80 | Review: $0.45
└── Total: $2.75/day (target: < $3.00)
Alerts:
- Daily spend > $5.00 → Telegram notification
- Single request > $0.50 → Log for review
- Weekly trend > 20% increase → Investigate
Cost optimization techniques that work:
- Model tiering: Use expensive models (Opus) only for complex tasks. Route simple tasks to cheaper models (Haiku, Gemini Flash). Our CMD Center uses 6 different model tiers based on task complexity.
- Prompt optimization: Shorter, more precise prompts = fewer input tokens = lower cost. A 50% reduction in prompt length can save 30-40% on costs.
- Response caching: Cache identical or semantically similar queries. For FAQ-style applications, cache hit rates of 30-50% are achievable.
- Batch processing: Some providers offer batch pricing at 50% discount for non-real-time workloads.
Building Feedback Loops
The most important observability practice: collecting user feedback and using it to improve the system.
- Thumbs up/down on responses. Simple, low-friction. Gives you a quality signal at scale. Target: < 5% negative feedback rate.
- Follow-up behavior. If a user immediately rephrases the same question, the first response probably wasn't helpful. Track "rephrase rate" as a proxy for quality.
- Escalation to human. If the user asks to speak to a human after AI interaction, the AI failed. Track escalation rate and analyze the failed conversations.
- Weekly review of low-rated responses. Have a human review the worst 20 responses each week. Identify patterns: is it a specific topic? A specific prompt template? A data gap in the knowledge base?
Tools Comparison (2026)
| Tool | Best For | Key Features | Pricing |
|---|---|---|---|
| LangSmith | LangChain-based applications | Trace visualization, evaluation datasets, prompt playground | Free tier / $39+/mo |
| Langfuse | Open-source LLM observability | Self-hostable, traces, cost tracking, evaluation | Free (self-hosted) / $59+/mo (cloud) |
| Arize Phoenix | ML + LLM observability | Embedding visualization, drift detection, A/B evaluation | Free (open-source) / Commercial |
| Helicone | LLM proxy with observability | Drop-in proxy, cost tracking, rate limiting, caching | Free tier / $20+/mo |
| Weights & Biases (W&B) | ML experiment tracking + LLM eval | Experiment comparison, model registry, Weave for LLM traces | Free tier / $50+/mo |
| Custom (OpenTelemetry) | Full control, existing observability stack | Custom spans for LLM calls, integrates with Grafana/Datadog | Infrastructure cost only |
For teams getting started: Langfuse (self-hosted) or Helicone (cloud proxy). Both are lightweight, provide the essential metrics (traces, cost, latency), and don't require significant integration effort. For enterprise: LangSmith if you're using LangChain, Arize if you need drift detection for ML models alongside LLM monitoring.
Frequently Asked Questions
How do we evaluate LLM quality at scale?
Three-tier approach: (1) Automated metrics (BLEU, ROUGE for summarization; exact match for structured outputs) catch obvious failures. (2) LLM-as-judge evaluates a sample of responses for relevance, accuracy, and helpfulness. (3) Human review of flagged responses and periodic random samples. Budget for 5-10% of responses to get human evaluation in the first 3 months.
What's a reasonable cost per LLM request?
Depends on the use case. Customer-facing chatbot: $0.01-0.05 per conversation (using Haiku/Flash for simple queries, Sonnet for complex). Internal tool: $0.05-0.20 per request is acceptable if it saves human time. Code generation: $0.10-0.50 per suggestion. Cost optimization should target 30-50% reduction from initial deployment through model tiering and prompt optimization.
How do we detect prompt injection in production?
Monitor for: (1) Outputs that reference system prompts or internal instructions. (2) Responses dramatically different in format or content from normal patterns. (3) Requests with suspicious patterns (encoded instructions, role-play prompts). Use input classifiers to flag potential injection attempts before they reach the LLM. Log and review all flagged inputs weekly.
Can we use traditional APM tools for LLM monitoring?
Partially. Datadog, New Relic, and Grafana handle infrastructure metrics well. But they lack LLM-specific features: token tracking, prompt/response logging, hallucination detection, and evaluation frameworks. The best approach: use your existing APM for infrastructure + a specialized LLM tool (Langfuse, Helicone) for AI-specific metrics. They complement, not replace, each other.