Observability for AI Apps: Traces, Costs, Hallucinations, and Feedback Loops
Traditional monitoring falls short for AI applications. Learn to build observability systems that track LLM traces, API costs, hallucination rates, and user feedback.
Traditional APM tools fall short for AI applications. LLMs require specialized observability: trace every LLM call with full prompts and completions, monitor costs in real-time per user/feature (not just total spend), detect hallucinations through confidence scores and citation validation, collect user feedback (thumbs up/down plus implicit signals), and build dashboards that show quality metrics stakeholders actually care about. Without this, you're flying blind.
Why Traditional Monitoring Isn't Enough
You can't monitor an AI application the same way you monitor a web app.Traditional APM tools (Datadog, New Relic, etc.) track:• Request latency• Error rates• Resource utilization• API response codesThis is necessary but not sufficient for LLM applications.What Traditional Monitoring Misses:Prompt Quality: Your app returns 200 OK, but the LLM output is garbage. Traditional monitoring says everything is fine.Model Drift: OpenAI updates GPT-4. Suddenly your responses change. No errors, no alerts, but quality degraded.Cost Spikes: A bug causes infinite loops of LLM calls. You don't notice until the $5,000 OpenAI bill arrives.Hallucinations: The LLM confidently states incorrect information. Status code: 200. Error rate: 0%. User satisfaction: tanking.Context Issues: RAG retrieves wrong documents. LLM answers based on bad context. Looks successful to traditional monitoring.The Gap:Traditional tools answer: "Is the system up?"LLM tools must answer: "Is the system producing correct, high-quality outputs at reasonable cost?"What AI Observability Looks Like:✅ Every LLM call traced - Inputs, outputs, tokens, cost✅ Real-time cost tracking - Per user, per feature, per endpoint✅ Quality metrics - Hallucination detection, confidence scores✅ User feedback - Thumbs up/down, retry rates, abandonment✅ Prompt versioning - Track changes and their impact✅ Stakeholder dashboards - Non-technical metrics leadership cares aboutThis post walks through building all of this.
LLM Tracing: See Every Call
You need visibility into every LLM interaction. Distributed tracing for AI chains.What to Capture:For every LLM call:• Request timestamp• Full prompt (system + user messages)• Model used (gpt-4, gpt-3.5-turbo, etc.)• Temperature & parameters• Full completion (LLM response)• Tokens used (input + output)• Latency (time to first token, total time)• Cost (calculated from tokens + model pricing)• Any errorsFor RAG chains, also capture:• Query embedding (the vector)• Retrieved documents (what context was used)• Retrieval scores (relevance metrics)• Reranking steps (if applicable)Implementation:
1 import uuid
2 from datetime import datetime
3 import json
4
5 class LLMTracer:
6 def __init__(self, trace_storage):
7 self.storage = trace_storage
8
9 async def trace_llm_call(
10 self,
11 trace_id: str,
12 span_name: str,
13 model: str,
14 messages: list,
15 response: any,
16 metadata: dict = {}
17 ):
18 span = {
19 "span_id": str(uuid.uuid4()),
20 "trace_id": trace_id,
21 "name": span_name,
22 "start_time": datetime.utcnow().isoformat(),
23 "model": model,
24 "messages": messages, # Full prompt
25 "response": response.choices[0].message.content,
26 "usage": {
27 "prompt_tokens": response.usage.prompt_tokens,
28 "completion_tokens": response.usage.completion_tokens,
29 "total_tokens": response.usage.total_tokens
30 },
31 "cost": self.calculate_cost(model, response.usage),
32 "latency_ms": metadata.get("latency_ms"),
33 "metadata": metadata
34 }
35
36 await self.storage.store(span)
37 return span
38
39 def calculate_cost(self, model: str, usage):
40 # Pricing as of Dec 2024
41 rates = {
42 "gpt-4": {"input": 0.03/1000, "output": 0.06/1000},
43 "gpt-3.5-turbo": {"input": 0.001/1000, "output": 0.002/1000}
44 }
45
46 rate = rates.get(model, rates["gpt-4"])
47 cost = (usage.prompt_tokens * rate["input"] +
48 usage.completion_tokens * rate["output"])
49 return round(cost, 6)1 tracer = LLMTracer(trace_storage)
2
3 async def handle_chat(user_query: str, trace_id: str = None):
4 if not trace_id:
5 trace_id = str(uuid.uuid4())
6
7 # Trace the LLM call
8 start = time.time()
9 response = await openai.chat.completions.create(
10 model="gpt-4",
11 messages=[
12 {"role": "system", "content": "You are a helpful assistant"},
13 {"role": "user", "content": user_query}
14 ]
15 )
16 latency = (time.time() - start) * 1000
17
18 await tracer.trace_llm_call(
19 trace_id=trace_id,
20 span_name="chat_completion",
21 model="gpt-4",
22 messages=[...],
23 response=response,
24 metadata={"latency_ms": latency, "user_id": user.id}
25 )
26
27 return response.choices[0].message.content1 async def rag_query(query: str): 2 trace_id = str(uuid.uuid4()) 3 4 # Step 1: Embed query 5 embedding = await trace_embedding(query, trace_id) 6 7 # Step 2: Retrieve docs 8 docs = await trace_retrieval(embedding, trace_id) 9 10 # Step 3: Rerank 11 ranked = await trace_reranking(docs, query, trace_id) 12 13 # Step 4: Generate 14 response = await trace_generation(query, ranked, trace_id) 15 16 return response
Each step logs a span. You can visualize the entire chain.Visualization:Use tools like LangSmith, Weights & Biases, or build a custom UI:
1 Trace: abc123 2 ├─ embedding(50ms, $0.0001) 3 ├─ vector_search(120ms, $0.002) 4 ├─ reranking(80ms, $0.01) 5 └─ llm_generation(2000ms, $0.06) 6 Total: 2250ms, $0.0721
Key Insight:Seeing the full trace lets you debug issues that would be impossible with traditional logs.
Cost Monitoring That Actually Works
LLM costs can spiral out of control fast. You need real-time monitoring with alerts.The Problem:Most teams monitor costs monthly via OpenAI billing dashboard. By the time you notice the spike, you've already spent $10K.What You Actually Need:Real-Time Cost Tracking:• Per user (who's expensive?)• Per feature (which features cost most?)• Per endpoint (is RAG costlier than chat?)• Per day/hour (when do costs spike?)Implementation:
1 from dataclasses import dataclass
2 from datetime import datetime, timedelta
3 import redis
4
5 @dataclass
6 class CostMetrics:
7 user_id: str = None
8 feature: str = None
9 endpoint: str = None
10 cost: float = 0
11 timestamp: datetime = None
12
13 class CostMonitor:
14 def __init__(self, redis_client):
15 self.redis = redis_client
16 self.budget_per_user_daily = 5.0
17 self.budget_total_hourly = 100.0
18
19 async def track_cost(self, metrics: CostMetrics):
20 # Store in time-series
21 key = f"cost:{metrics.feature}:{metrics.endpoint}"
22 await self.redis.zadd(
23 key,
24 {json.dumps(metrics.__dict__): metrics.timestamp.timestamp()}
25 )
26
27 # Update user spend
28 if metrics.user_id:
29 user_key = f"user_cost:{metrics.user_id}:{datetime.utcnow().date()}"
30 await self.redis.incrbyfloat(user_key, metrics.cost)
31 await self.redis.expire(user_key, 86400 * 7) # 7 day retention
32
33 # Check user budget
34 user_spend = float(await self.redis.get(user_key) or 0)
35 if user_spend > self.budget_per_user_daily:
36 await self.alert_user_budget_exceeded(
37 metrics.user_id,
38 user_spend
39 )
40
41 # Check total hourly budget
42 hour_key = f"total_cost:{datetime.utcnow().strftime('%Y-%m-%d-%H')}"
43 await self.redis.incrbyfloat(hour_key, metrics.cost)
44 await self.redis.expire(hour_key, 86400)
45
46 hour_spend = float(await self.redis.get(hour_key) or 0)
47 if hour_spend > self.budget_total_hourly:
48 await self.alert_hourly_budget_exceeded(hour_spend)
49
50 async def get_cost_breakdown(self, timeframe: timedelta):
51 cutoff = datetime.utcnow() - timeframe
52
53 # Aggregate by feature
54 features = {}
55 for key in await self.redis.keys("cost:*"):
56 entries = await self.redis.zrangebyscore(
57 key,
58 cutoff.timestamp(),
59 datetime.utcnow().timestamp()
60 )
61
62 for entry in entries:
63 data = json.loads(entry)
64 feature = data["feature"]
65 features[feature] = features.get(feature, 0) + data["cost"]
66
67 return features1 cost_monitor = CostMonitor(redis_client) 2 3 async def handle_request(user_id, feature, endpoint): 4 response = await llm_call(...) 5 6 cost = calculate_cost(response.usage) 7 8 await cost_monitor.track_cost(CostMetrics( 9 user_id=user_id, 10 feature=feature, 11 endpoint=endpoint, 12 cost=cost, 13 timestamp=datetime.utcnow() 14 ))
1 alerts = {
2 "user_daily_budget": {
3 "threshold": 5.0,
4 "action": "rate_limit_user"
5 },
6 "total_hourly_budget": {
7 "threshold": 100.0,
8 "action": "page_oncall"
9 },
10 "cost_spike": {
11 "threshold": "3x_baseline",
12 "action": "investigate"
13 }
14 }Cost Dashboard:Display:• Cost per user (top 10 spenders)• Cost per feature (which features are expensive?)• Cost trend (are we growing or shrinking?)• Cost by model (gpt-4 vs gpt-3.5 usage)• Projected monthly cost (extrapolate from daily)Real Example:One client had a user who accidentally triggered an infinite loop. Cost monitor caught it within 10 minutes:• Alert: User XYZ spent $47 in last hour (normal: $0.50/day)• Action: Rate limited user, investigated logs• Saved: ~$1,100 in potential overnight spendCost Optimization Strategies:1. Cache aggressively - Identical queries = cached responses2. Use cheaper models - GPT-3.5 for simple queries3. Prompt compression - Trim unnecessary tokens4. Batch requests - Multiple queries in one call5. User limits - Hard caps per user per dayMonitoring enables optimization. You can't optimize what you don't measure.
Detecting Hallucinations
Hallucinations are when LLMs confidently state false information. They're hard to catch but critical to detect.Types of Hallucinations:1. Factual Hallucinations:LLM makes up facts that sound plausible.Example: "Python was invented in 1995" (actually 1991)2. Contradictory Hallucinations:LLM contradicts itself within the same response.Example: "The capital is London... Paris is the capital"3. Contextual Hallucinations:LLM makes up information not in the provided context.Example: RAG provides document A, LLM cites non-existent document BDetection Strategies:Strategy 1: Confidence ScoringAsk the LLM to rate its own confidence:
1 def get_response_with_confidence(query):
2 response = openai.chat.completions.create(
3 model="gpt-4",
4 messages=[
5 {"role": "system", "content": "You are a helpful assistant. Always include your confidence level(low/medium/high) in your answer."},
6 {"role": "user", "content": query}
7 ]
8 )
9
10 text = response.choices[0].message.content
11
12 # Parse confidence
13 if "confidence: low" in text.lower():
14 confidence = "low"
15 elif "confidence: high" in text.lower():
16 confidence = "high"
17 else:
18 confidence = "medium"
19
20 return text, confidenceLow confidence = potential hallucination. Flag for review.Strategy 2: Citation ValidationFor RAG systems, verify citations actually exist:
1 def validate_citations(response, retrieved_docs):
2 # Extract citations from response
3 citations = extract_citations(response) # e.g., [doc_id]
4
5 doc_ids = {doc["id"] for doc in retrieved_docs}
6
7 invalid_citations = [c for c in citations if c not in doc_ids]
8
9 if invalid_citations:
10 return {
11 "hallucination_detected": True,
12 "invalid_citations": invalid_citations,
13 "confidence": "low"
14 }
15
16 return {"hallucination_detected": False, "confidence": "high"}Strategy 3: Consistency ChecksAsk the same question multiple times:
1 async def consistency_check(query, n=3):
2 responses = []
3 for _ in range(n):
4 response = await llm_call(query)
5 responses.append(response)
6
7 # Calculate similarity between responses
8 similarities = []
9 for i in range(len(responses)):
10 for j in range(i+1, len(responses)):
11 sim = calculate_similarity(responses[i], responses[j])
12 similarities.append(sim)
13
14 avg_similarity = sum(similarities) / len(similarities)
15
16 if avg_similarity < 0.7:
17 # Responses are inconsistent
18 return {
19 "hallucination_risk": "high",
20 "consistency_score": avg_similarity
21 }
22
23 return {"hallucination_risk": "low", "consistency_score": avg_similarity}Strategy 4: External ValidationFor factual claims, verify against trusted sources:
1 async def validate_factual_claim(claim):
2 # Search Wikipedia, fact-check APIs, etc.
3 search_results = await search_external_sources(claim)
4
5 if not search_results:
6 return {"validated": False, "confidence": "low"}
7
8 # Use LLM to compare claim vs search results
9 validation = await llm_call(f"""
10 Claim: {claim}
11 Evidence: {search_results}
12
13 Does the evidence support the claim? Answer yes/no and explain.
14 """)
15
16 return {"validated": "yes" in validation.lower(), "explanation": validation}Strategy 5: User FeedbackUsers are your best hallucination detectors:
1 def collect_hallucination_feedback(response_id):
2 # UI shows:
3 # "Was this response accurate?"
4 # [ Yes ] [ No - Incorrect Information ]
5
6 if user_clicks_no:
7 log_potential_hallucination(response_id)
8
9 # Ask for details
10 issue = prompt_user("What was incorrect?")
11
12 store_feedback(response_id, {
13 "hallucination": True,
14 "issue": issue,
15 "timestamp": datetime.utcnow()
16 })Aggregated Detection:Combine multiple signals:
1 def hallucination_score(response, context):
2 scores = {
3 "confidence": get_confidence_score(response), # 0-1
4 "citations": validate_citations_score(response, context), # 0-1
5 "user_reports": get_user_report_rate(similar_responses), # 0-1
6 }
7
8 # Weighted average
9 weights = {"confidence": 0.3, "citations": 0.4, "user_reports": 0.3}
10 final_score = sum(scores[k] * weights[k] for k in weights)
11
12 if final_score < 0.5:
13 return {"risk": "high", "score": final_score}
14 elif final_score < 0.7:
15 return {"risk": "medium", "score": final_score}
16 else:
17 return {"risk": "low", "score": final_score}What to Do When Detected:1. Flag for human review2. Add disclaimer ("This answer may not be accurate")3. Offer alternative ("Would you like me to search for verified information?")4. Log for training (use as negative example in evals)Hallucination detection is imperfect, but catching 80% is better than catching 0%.
User Feedback Collection
User feedback is your most valuable signal. Collect both explicit and implicit feedback.Explicit Feedback:Thumbs Up/Down:
1 def track_feedback(response_id, feedback_type, user_id):
2 feedback = {
3 "response_id": response_id,
4 "user_id": user_id,
5 "type": feedback_type, # "thumbs_up" | "thumbs_down"
6 "timestamp": datetime.utcnow()
7 }
8
9 store_feedback(feedback)
10
11 # Update response quality score
12 update_quality_score(response_id, feedback_type)Detailed Feedback:When user clicks thumbs down, ask why:
1 feedback_options = [ 2 "Incorrect information", 3 "Not helpful", 4 "Too verbose", 5 "Misunderstood question", 6 "Other(please specify)" 7 ]
Rating Scale:For critical applications, use 1-5 stars:
1 def track_rating(response_id, rating, comment=None):
2 store_rating({
3 "response_id": response_id,
4 "rating": rating, # 1-5
5 "comment": comment,
6 "timestamp": datetime.utcnow()
7 })Implicit Feedback:Often more valuable because users don't have to take action.Retry Rate:User regenerates response = implicit dissatisfaction
1 def track_regeneration(response_id, user_id):
2 # User clicked "regenerate"
3 log_event({
4 "event": "response_regenerated",
5 "response_id": response_id,
6 "user_id": user_id,
7 "implicit_feedback": "negative"
8 })Edit Rate:User edits the response = it was almost right but not quite
1 def track_edit(response_id, original, edited):
2 # Analyze what changed
3 diff = calculate_diff(original, edited)
4
5 log_event({
6 "event": "response_edited",
7 "response_id": response_id,
8 "diff_size": len(diff),
9 "implicit_feedback": "neutral"
10 })Abandonment:User closes chat quickly = bad response
1 def track_session(session_id, user_id):
2 start_time = time.time()
3
4 # ... interaction happens ...
5
6 end_time = time.time()
7 duration = end_time - start_time
8
9 if duration < 30: # User left quickly
10 log_event({
11 "event": "session_abandoned",
12 "session_id": session_id,
13 "duration_seconds": duration,
14 "implicit_feedback": "negative"
15 })Copy to Clipboard:User copies response = likely satisfied
1 def track_copy(response_id):
2 log_event({
3 "event": "response_copied",
4 "response_id": response_id,
5 "implicit_feedback": "positive"
6 })Aggregated Satisfaction Score:
1 def calculate_satisfaction_score(response_id):
2 events = get_events_for_response(response_id)
3
4 score = 0
5
6 # Explicit feedback
7 if events.get("thumbs_up"):
8 score += 1.0
9 if events.get("thumbs_down"):
10 score -= 1.0
11
12 # Implicit feedback
13 if events.get("copied"):
14 score += 0.5
15 if events.get("regenerated"):
16 score -= 0.5
17 if events.get("edited"):
18 score -= 0.2
19 if events.get("abandoned"):
20 score -= 0.7
21
22 # Normalize to 0-1
23 normalized = (score + 2) / 4
24 return max(0, min(1, normalized))Feedback Loop:Use feedback to improve:1. Add to eval set - Bad responses become test cases2. Retrain - Use feedback as training signal3. Prompt tuning - Adjust prompts based on patterns4. Feature flags - Roll back changes that degrade satisfactionDashboard Metrics:Track:• Thumbs up rate (target: >70%)• Thumbs down rate (target: <15%)• Regeneration rate (target: <10%)• Average satisfaction score (target: >0.7)Alert when metrics drop >10% week-over-week.
Quality Dashboards for Stakeholders
Technical metrics are important, but stakeholders care about business metrics.What Leadership Cares About:❌ "Prompt tokens decreased by 12%"✅ "Cost per user decreased by 20%"❌ "P95 latency improved to 1.2s"✅ "User satisfaction increased from 72% to 85%"Build Two Dashboards:1. Technical Dashboard (for engineers):• LLM call latency (p50, p95, p99)• Token usage• Error rates• Cache hit rates• Prompt version performance2. Business Dashboard (for stakeholders):• User satisfaction score• Cost per user• Cost trend• Feature usage• Quality score• Support ticket reductionBusiness Dashboard Implementation:
1 class BusinessMetricsDashboard:
2 def get_metrics(self, timeframe: timedelta):
3 return {
4 "user_satisfaction": {
5 "current": self.get_satisfaction_rate(timeframe),
6 "previous": self.get_satisfaction_rate(timeframe * 2, timeframe),
7 "change": self.calculate_change(...),
8 "trend": "up" | "down" | "flat"
9 },
10 "cost_per_user": {
11 "current": self.get_avg_cost_per_user(timeframe),
12 "previous": self.get_avg_cost_per_user(timeframe * 2, timeframe),
13 "change": self.calculate_change(...),
14 "trend": "up" | "down" | "flat"
15 },
16 "quality_score": {
17 "current": self.get_avg_quality_score(timeframe),
18 "target": 0.85,
19 "status": "meeting" | "below" | "exceeding"
20 },
21 "monthly_cost_projection": self.project_monthly_cost(),
22 "top_features": self.get_feature_usage(timeframe),
23 "support_ticket_impact": {
24 "tickets_deflected": self.estimate_tickets_deflected(timeframe),
25 "estimated_savings": self.calculate_support_savings(timeframe)
26 }
27 }1 ╔════════════════════════════════════════╗ 2 ║ AI Quality & Business Impact ║ 3 ╠════════════════════════════════════════╣ 4 ║ User Satisfaction: 85% ↑ 13% ║ 5 ║ Cost per User: $0.42 ↓ 20% ║ 6 ║ Quality Score: 87% ✓ Target: 85% ║ 7 ║ Monthly Cost Projection: $12,400 ║ 8 ╠════════════════════════════════════════╣ 9 ║ Support Impact ║ 10 ║ • 1,240 tickets deflected this month ║ 11 ║ • Est. savings: $31,000 ║ 12 ╚════════════════════════════════════════╝
Stakeholder-Friendly Language:❌ Technical: "LLM hallucinationrate: 2.3%"✅ Business: "Answer accuracy: 97.7%"❌ Technical: "Token utilization increased 40%"✅ Business: "Costs increased 40% due to higher usage"❌ Technical: "P95 latency: 1.2s"✅ Business: "95% of responses in under 1.2 seconds"Monthly Report Template:
1 # AI System Performance - December 2024 2 3 ## Executive Summary 4 Our AI assistant served 45,000 users this month with 85% satisfaction. 5 Cost per user decreased 20% through optimizations. 6 7 ## Key Metrics 8 - User Satisfaction: 85% (↑13% vs last month) 9 - Cost Efficiency: $0.42 per user(↓20% vs last month) 10 - Quality Score: 87% (target: 85%) 11 - Support Tickets Deflected: 1,240 12 - Estimated Savings: $31,000 13 14 ## What Improved 15 - Reduced hallucinations through citation validation 16 - Optimized prompts to use 30% fewer tokens 17 - Improved response relevance with better RAG 18 19 ## Areas to Watch 20 - Complex technical queries still require human escalation(8% of queries) 21 - Cost per user increasing for power users(top 5%) 22 23 ## Next Month Focus 24 - Improve complex query handling 25 - Implement tiered user limits for cost control
This language resonates with non-technical stakeholders.
Alerting Strategy
Not all issues are equal. Alert on what matters, ignore the noise.Critical Alerts (Page Oncall):1. Error rate > 5% - System is broken - Users can't get responses2. Cost spike > 3x baseline - Potential infinite loop - User abuse - Could burn budget in hours3. Quality score drops > 20% - Major degradation - Model changed? - Prompt broken?Warning Alerts (Slack/Email):1. Quality score drops 10-20% - Investigate within 24h2. Cost increase 50-100% - Normal growth or issue?3. Latency p95 > 5s - User experience degrading4. User satisfaction < 70% - Below acceptable thresholdInfo Alerts (Dashboard Only):1. Cache hit rate changes2. Feature usage shifts3. Model distribution changesImplementation:
1 from enum import Enum
2
3 class AlertSeverity(Enum):
4 CRITICAL = "critical"
5 WARNING = "warning"
6 INFO = "info"
7
8 class AlertManager:
9 def __init__(self):
10 self.baselines = self.load_baselines()
11
12 async def check_metrics(self, current_metrics):
13 alerts = []
14
15 # Error rate
16 if current_metrics["error_rate"] > 0.05:
17 error_rate = current_metrics["error_rate"]
18 alerts.append(Alert(
19 severity=AlertSeverity.CRITICAL,
20 title="High Error Rate",
21 message=f"Error rate at {{error_rate:.1%}}",
22 action="page_oncall"
23 ))
24
25 # Cost spike
26 baseline_cost = self.baselines["hourly_cost"]
27 current_cost = current_metrics["hourly_cost"]
28 if current_cost > baseline_cost * 3:
29 msg = f"Hourly cost: USD {{current_cost:.2f}} (baseline: USD {{baseline_cost:.2f}})"
30 alerts.append(Alert(
31 severity=AlertSeverity.CRITICAL,
32 title="Cost Spike Detected",
33 message=msg,
34 action="investigate_immediately"
35 ))
36
37 # Quality drop
38 baseline_quality = self.baselines["quality_score"]
39 current_quality = current_metrics["quality_score"]
40 quality_drop = (baseline_quality - current_quality) / baseline_quality
41
42 if quality_drop > 0.2:
43 alerts.append(Alert(
44 severity=AlertSeverity.CRITICAL,
45 title="Quality Degradation",
46 message=f"Quality score dropped {{quality_drop:.1%}}",
47 action="investigate_immediately"
48 ))
49 elif quality_drop > 0.1:
50 alerts.append(Alert(
51 severity=AlertSeverity.WARNING,
52 title="Quality Drop",
53 message=f"Quality score dropped {{quality_drop:.1%}}",
54 action="investigate_within_24h"
55 ))
56
57 return alerts
58
59 async def handle_alert(self, alert: Alert):
60 if alert.severity == AlertSeverity.CRITICAL:
61 await self.page_oncall(alert)
62 await self.post_to_slack("#incidents", alert)
63 elif alert.severity == AlertSeverity.WARNING:
64 await self.post_to_slack("#ai-monitoring", alert)
65 else:
66 await self.log_to_dashboard(alert)Alert Fatigue Prevention:• Aggregate similar alerts - Don't send 100 alerts for same issue• Time windows - Alert once per hour max• Thresholds - Tune to reduce false positives• Auto-resolution - Clear alerts when metrics recoverRunbooks:Each alert should have a runbook:
1 # Alert: High Cost Spike 2 3 ## What This Means 4 Hourly costs exceeded 3x baseline. Potential infinite loop or abuse. 5 6 ## Immediate Actions 7 1. Check cost dashboard for user/feature breakdown 8 2. Identify source of spike 9 3. If abuse: rate limit user 10 4. If bug: deploy fix or rollback 11 5. If legitimate: increase budget alert threshold 12 13 ## Investigation 14 - Check recent deployments 15 - Review user activity logs 16 - Analyze LLM call traces 17 18 ## Escalation 19 If can't resolve in 30 min, page engineering lead.
Good alerting = Right alert, right person, right time.
Tools and Stack
Don't build everything from scratch. Use purpose-built tools.LLM Observability Platforms:LangSmith (our top pick)• Built by LangChain team• Excellent LLM tracing• Dataset management• Human evaluation UI• Cost: $50-200/moHelicone• Lightweight proxy• Cost tracking• Caching layer• Open source• Cost: Free tier availableWeights & Biases (Prompts)• ML experiment tracking adapted for LLMs• Great visualizations• Prompt versioning• Cost: Free for individualsBraintrust• AI evaluation platform• Advanced metrics• CI/CD integration• Cost: Usage-basedArize AI• ML observability platform• Drift detection• Explainability• Cost: EnterpriseDIY Stack:If you want to build your own:
1 # Core components
2 observability_stack = {
3 "tracing": "OpenTelemetry + custom LLM spans",
4 "metrics": "Prometheus + Grafana",
5 "logs": "Elasticsearch + Kibana",
6 "cost_tracking": "Redis + custom dashboard",
7 "feedback": "PostgreSQL + custom UI"
8 }Our Recommended Stack:For Startups (<$10K/mo LLM spend):• LangSmith for tracing and evals• Grafana Cloud for metrics• Custom cost tracking in Redis• Built-in feedback collectionFor Scale-ups ($10-100K/mo LLM spend):• LangSmith or Helicone for tracing• Datadog for metrics and alerting• Custom cost dashboard• Dedicated feedback systemFor Enterprise (>$100K/mo LLM spend):• Arize AI or custom observability platform• Full distributed tracing• Real-time anomaly detection• Dedicated observability teamIntegration Example:
1 from langsmith import Client
2 import prometheus_client as prom
3
4 # Initialize
5 langsmith = Client()
6 cost_counter = prom.Counter('llm_cost_total', 'Total LLM cost', ['model', 'feature'])
7 latency_histogram = prom.Histogram('llm_latency_seconds', 'LLM latency', ['model'])
8
9 async def traced_llm_call(query, feature):
10 # Start LangSmith trace
11 run_id = langsmith.create_run(
12 name="llm_call",
13 inputs={"query": query}
14 )
15
16 start = time.time()
17 try:
18 response = await openai_call(query)
19 latency = time.time() - start
20 cost = calculate_cost(response)
21
22 # Update Prometheus metrics
23 cost_counter.labels(model="gpt-4", feature=feature).inc(cost)
24 latency_histogram.labels(model="gpt-4").observe(latency)
25
26 # Complete LangSmith trace
27 langsmith.update_run(
28 run_id,
29 outputs={"response": response},
30 end_time=datetime.utcnow()
31 )
32
33 return response
34 except Exception as e:
35 langsmith.update_run(run_id, error=str(e))
36 raiseCost Estimate:DIY: $200-500/mo (infrastructure + dev time)LangSmith + Grafana: $100-300/moFull Enterprise Stack: $1,000-5,000/moObservability is an investment that pays for itself by catching issues early.
We help teams design and ship production-grade software in eLearning, fintech, and AI. Let's talk about your project.
Book a call