LLM Evaluation Playbook: How We Measure Quality Beyond 'Vibes'
Evaluating LLM outputs is hard. Move beyond manual spot-checks with our framework for systematic quality measurement using eval sets, metrics, and continuous monitoring.
Evaluating LLM outputs is hard because there's no ground truth and responses are probabilistic. 'Vibes-based' spot-checking doesn't scale. Build a comprehensive eval set with 100+ examples covering happy paths, edge cases, and adversarial inputs. Combine automated metrics (exact match, semantic similarity, LLM-as-judge) with human evaluation. Monitor production quality continuously using user feedback and drift detection. Version your prompts like code—track performance over time and A/B test changes.
The Problem: 'Vibes-Based' Evaluation
"The AI looks good to me!" is not a quality bar.We see this pattern constantly: teams build an LLM feature, manually test a few examples, decide it "feels right," and ship to production. Three weeks later, users are complaining and nobody knows why.Why Manual Spot-Checking Fails:Non-deterministic outputs: Run the same prompt twice, get different answers. Which one is "correct"? Both? Neither?Sampling bias: You test happy paths. Users hit edge cases. The first support ticket is something you never imagined.Scale: You test 10 examples. Production sees 10,000 queries per day. You've covered 0.1% of the input space.Prompt changes: You tweak the prompt to fix one issue. Three other things break. You don't notice until production.The "vibes" approach doesn't work because:1. You can't measure improvement without metrics2. You can't catch regressions without tests3. You can't debug failures without data4. You can't optimize what you don't measureWhat Good Evaluation Looks Like:A systematic process that combines:• Eval sets - Comprehensive test cases with expected behaviors• Automated metrics - Quantitative measurements of quality• Human evaluation - Subjective quality checks• Production monitoring - Real-time quality tracking• Versioning - Track prompt changes and their impactThis post walks through our evaluation framework that we use on every LLM project.
Building Your Eval Set
Your eval set is the foundation. Garbage in, garbage out.Start with 100 Examples Minimum:Fewer than 100 and you won't catch edge cases. We aim for 200-500 for production systems.The Four Quadrants:1. Happy Path (40%):Typical queries users will ask. These should work reliably.Example for customer support:• "How do I reset my password?"• "What's your refund policy?"• "Can I change my plan?"2. Edge Cases (30%):Valid queries that are tricky.• "Can I refund a plan I upgraded yesterday?"• "I deleted my account but want to recover it"• Typos, grammar issues, non-English input3. Negative Examples (20%):Invalid or out-of-scope queries.• "What's the weather in Tokyo?"• "Write my term paper"• Harmful or inappropriate requestsThe LLM should gracefully decline these.4. Adversarial Examples (10%):Inputs designed to break the system.• Prompt injection attempts• Extremely long inputs• Nonsense or gibberish• Edge cases that stress parsing logicHow to Build Your Eval Set:Start with production logs (if available):
1 # Sample real queries 2 SELECT query, response, user_rating 3 FROM llm_interactions 4 WHERE created_at > NOW() - INTERVAL '30 days' 5 ORDER BY RANDOM() 6 LIMIT 500;
1 # Generate variations 2 base_query = "How do I reset my password?" 3 4 variations = [ 5 "password reset", 6 "forgot my password", 7 "can't login", 8 "locked out of account", 9 "pwreset", # typo 10 "RESET PASSWORD", # caps 11 ]
Include failure cases:Review support tickets, error logs, user complaints. These are goldmines.Format:
1 eval_set = [
2 {
3 "id": "001",
4 "category": "happy_path",
5 "input": "How do I reset my password?",
6 "expected_behavior": "Provides clear password reset instructions",
7 "should_include": ["password reset link", "email"],
8 "should_not_include": ["call support", "not possible"],
9 "metadata": {
10 "difficulty": "easy",
11 "tags": ["account", "authentication"]
12 }
13 },
14 # ... more examples
15 ]Key Principles:• Representative - Covers real user distribution• Specific - Clear expected behavior• Versioned - Track changes over time• Maintained - Update as product evolvesYour eval set is living documentation. Commit it to Git. Review it in PRs.
Automated Metrics That Matter
Automated metrics let you run evals in CI/CD and catch regressions before production.Metric 1: Exact MatchDoes the output match the expected answer exactly?
1 def exact_match(predicted: str, expected: str) -> float: 2 # Normalize whitespace and case 3 pred = predicted.strip().lower() 4 exp = expected.strip().lower() 5 return 1.0 if pred == exp else 0.0
When to use: Structured outputs (JSON, codes, categories)When to avoid: Open-ended generationMetric 2: Semantic SimilarityHow similar is the meaning?
1 from sentence_transformers import SentenceTransformer, util
2
3 model = SentenceTransformer('all-MiniLM-L6-v2')
4
5 def semantic_similarity(predicted: str, expected: str) -> float:
6 pred_emb = model.encode(predicted)
7 exp_emb = model.encode(expected)
8 similarity = util.cos_sim(pred_emb, exp_emb)
9 return float(similarity[0][0])Threshold: Similarity > 0.8 usually indicates good alignmentWhen to use: Answers that can be phrased differentlyWhen to avoid: When exact details matter (instructions, codes)Metric 3: Keyword PresenceDoes output include required information?
1 def keyword_presence(predicted: str, required_keywords: list) -> dict:
2 pred_lower = predicted.lower()
3 results = {kw: kw.lower() in pred_lower for kw in required_keywords}
4
5 return {
6 "score": sum(results.values()) / len(results),
7 "missing": [kw for kw, present in results.items() if not present]
8 }1 keyword_presence(
2 predicted="Click the reset link in your email",
3 required_keywords=["email", "reset", "link"]
4 )
5 # Returns: {"score": 1.0, "missing": []}Metric 4: LLM-as-JudgeUse a strong LLM to evaluate a weaker one.
1 def llm_as_judge(query: str, predicted: str, rubric: str) -> dict:
2 evaluation_prompt = f"""
3 Evaluate this AI assistant response.
4
5 Query: {query}
6 Response: {predicted}
7
8 Rubric:
9 {rubric}
10
11 Rate 1-5 and explain:
12 1 = Completely wrong
13 2 = Partially correct
14 3 = Acceptable
15 4 = Good
16 5 = Excellent
17
18 Return JSON: {{"rating": int, "explanation": str}}
19 """
20
21 response = openai.chat.completions.create(
22 model="gpt-4",
23 messages=[{"role": "user", "content": evaluation_prompt}],
24 response_format={"type": "json_object"}
25 )
26
27 return json.loads(response.choices[0].message.content)The Power of LLM-as-Judge:• Understands nuance and context• Can evaluate open-ended generation• Cheaper than human evaluation• Fast enough for CI/CDThe Risk:• Judge can be wrong• Consistent but not always correct• Validate judge against human evals periodicallyMetric 5: Task-Specific MetricsCustom metrics for your domain.For code generation:• Does it execute without errors?• Does it pass unit tests?• Cyclomatic complexityFor summarization:• ROUGE score• Compression ratio• Key facts preservedCombining Metrics:
1 def evaluate_response(test_case, response):
2 scores = {
3 "exact_match": exact_match(response, test_case["expected"]),
4 "semantic_sim": semantic_similarity(response, test_case["expected"]),
5 "keywords": keyword_presence(response, test_case["required_keywords"]),
6 "llm_judge": llm_as_judge(test_case["query"], response, test_case["rubric"])
7 }
8
9 # Weighted average
10 weights = {"exact_match": 0.2, "semantic_sim": 0.3, "keywords": 0.2, "llm_judge": 0.3}
11 final_score = sum(scores[k] * weights[k] for k in weights)
12
13 return {
14 "score": final_score,
15 "details": scores,
16 "passed": final_score >= 0.7 # Threshold
17 }Run this on your entire eval set. Track pass rate over time.
Human Evaluation Workflow
Automated metrics are necessary but not sufficient. You need humans in the loop.When Human Eval is Required:• Subjective quality (tone, style, professionalism)• Nuanced correctness• Validating automated metrics• Exploring new failure modesDon't Evaluate Everything:Human time is expensive. Sample strategically.Sampling Strategy:
1 def sample_for_human_eval(results, sample_size=50): 2 samples = [] 3 4 # Always include failures 5 failures = [r for r in results if r["score"] < 0.5] 6 samples.extend(failures[:10]) 7 8 # Include borderline cases 9 borderline = [r for r in results if 0.5 <= r["score"] < 0.7] 10 samples.extend(random.sample(borderline, min(20, len(borderline)))) 11 12 # Random sample of successes 13 successes = [r for r in results if r["score"] >= 0.7] 14 samples.extend(random.sample(successes, min(20, len(successes)))) 15 16 return samples
Rating Scale:Keep it simple. We use:• 5 - Perfect response• 4 - Good, minor issues• 3 - Acceptable, some problems• 2 - Poor, misses key points• 1 - Completely wrongRating UI:
1 # Simple web interface
2 """
3 Query: {test_case["query"]}
4
5 Expected: {test_case["expected"]}
6
7 Actual Response:
8 {response}
9
10 Rate 1-5: [1] [2] [3] [4] [5]
11
12 Issues(optional):
13 [ ] Factually incorrect
14 [ ] Tone inappropriate
15 [ ] Missing key information
16 [ ] Verbose/unclear
17 [ ] Other: ___________
18
19 Comments: ___________
20 """Inter-Rater Reliability:Multiple raters should agree. Calculate:
1 from sklearn.metrics import cohen_kappa_score 2 3 rater_a = [5, 4, 3, 5, 2, 4] 4 rater_b = [5, 4, 4, 5, 2, 3] 5 6 kappa = cohen_kappa_score(rater_a, rater_b) 7 # Kappa > 0.7 = Good agreement
If agreement is low, refine your rubric.Calibration Sessions:Weekly, review edge cases as a team. Align on quality standards.Documentation:Save all human evaluations. They become golden labels for future automated metrics.
Production Monitoring
Evaluation doesn't stop at deployment. Monitor quality continuously.Leading Indicators of Quality Issues:1. User Feedback Signals:• Thumbs down rate• "Regenerate" button clicks• Conversation abandonment• Time to user correction
1 # Track implicit feedback
2 def track_interaction(session_id, user_action):
3 metrics = {
4 "thumbs_down": user_action == "downvote",
5 "regenerated": user_action == "regenerate",
6 "edited_response": user_action == "edit",
7 "abandoned": user_action == "close" and time_in_session < 30
8 }
9
10 log_metric(session_id, metrics)2. Response Characteristics:• Average response length (too long = verbose, too short = incomplete)• Latency (slow responses = user frustration)• Fallback rate (how often does it say "I don't know"?)3. Automated Evals on Live Traffic:Run your eval suite on a sample of production queries:
1 async def production_eval_sample(): 2 # Sample 1% of production traffic 3 if random.random() < 0.01: 4 # Run automated evals 5 eval_result = await evaluate_response(query, response) 6 7 # Log to monitoring 8 log_production_eval(eval_result) 9 10 # Alert if quality drops 11 if eval_result["score"] < 0.6: 12 alert_quality_issue(eval_result)
4. Drift Detection:LLM outputs can drift over time due to:• Model updates• Prompt changes• Data distribution shifts
1 def detect_drift(current_metrics, baseline_metrics, threshold=0.1):
2 drift = {}
3 for metric, current_val in current_metrics.items():
4 baseline_val = baseline_metrics.get(metric, current_val)
5 change = abs(current_val - baseline_val) / baseline_val
6
7 if change > threshold:
8 drift[metric] = {
9 "current": current_val,
10 "baseline": baseline_val,
11 "change_pct": change * 100
12 }
13
14 return driftDashboard:Track these metrics:• Pass rate (% scoring above threshold)• Average score• Latency (p50, p95, p99)• Cost per query• User satisfaction (thumbs up / thumbs down ratio)Alert when:• Pass rate drops > 5%• Latency increases > 20%• User satisfaction drops > 10%
Prompt Versioning and A/B Testing
Treat prompts like code: version them, test changes, and roll out gradually.Version Control:
1 # prompts/v1.py 2 SYSTEM_PROMPT_V1 = """ 3 You are a helpful customer support assistant. 4 Answer questions concisely and professionally. 5 """ 6 7 # prompts/v2.py 8 SYSTEM_PROMPT_V2 = """ 9 You are a helpful customer support assistant for Acme Corp. 10 Answer questions concisely and professionally. 11 If you don't know the answer, say so and offer to connect them with a human agent. 12 """
Track versions in Git. Link to pull requests.Measuring Impact:Before promoting a new prompt, measure:
1 def compare_prompt_versions(eval_set, prompt_v1, prompt_v2):
2 results_v1 = run_eval(eval_set, prompt_v1)
3 results_v2 = run_eval(eval_set, prompt_v2)
4
5 comparison = {
6 "v1_pass_rate": results_v1["pass_rate"],
7 "v2_pass_rate": results_v2["pass_rate"],
8 "improvement": results_v2["pass_rate"] - results_v1["pass_rate"],
9 "v1_avg_score": results_v1["avg_score"],
10 "v2_avg_score": results_v2["avg_score"],
11 "better": results_v2["pass_rate"] > results_v1["pass_rate"]
12 }
13
14 return comparisonOnly promote if v2 is better on your eval set.A/B Testing in Production:
1 def get_prompt_version(user_id): 2 # Consistent hashing for user assignment 3 hash_val = hash(user_id) % 100 4 5 if hash_val < 10: # 10% on v2 6 return "v2" 7 else: # 90% on v1 8 return "v1" 9 10 async def handle_query(user_id, query): 11 version = get_prompt_version(user_id) 12 prompt = PROMPTS[version] 13 14 response = await llm_call(prompt, query) 15 16 # Tag response with version 17 log_interaction(user_id, query, response, version) 18 19 return response
Gradual Rollout:1. Test on eval set2. Deploy to 10% of users3. Monitor for 24-48 hours4. If metrics look good, increase to 50%5. Monitor for another 24-48 hours6. If still good, promote to 100%Rollback:Keep the ability to instantly rollback:
1 # Feature flag
2 PROMPT_VERSION = os.getenv("PROMPT_VERSION", "v1")
3
4 if quality_issue_detected():
5 # Instant rollback
6 os.environ["PROMPT_VERSION"] = "v1"Cost vs Quality Trade-offs
Every LLM decision involves trade-offs between cost, quality, and latency.Model Selection:| Model | Cost per 1M tokens | Quality | Latency | Use Case ||-------|-------------------|---------|---------|----------|| GPT-4 | $30-60 | Excellent | Slow | Critical decisions || GPT-3.5 Turbo | $1-2 | Good | Fast | Most queries || Claude Haiku | $0.25-0.50 | Good | Very fast | High volume |Decision Framework:
1 def select_model(query, context): 2 # Use cheaper model for simple queries 3 if is_simple_query(query): 4 return "gpt-3.5-turbo" 5 6 # Use expensive model for complex/high-value queries 7 if is_complex(query) or is_high_value_user(context.user_id): 8 return "gpt-4" 9 10 # Default to mid-tier 11 return "gpt-3.5-turbo"
Quality-Cost Curve:Test your use case at different model tiers:
1 models = ["gpt-4", "gpt-3.5-turbo", "claude-haiku"]
2
3 for model in models:
4 results = run_eval(eval_set, model)
5 cost = estimate_monthly_cost(query_volume, model)
6
7 print(f"{model}: {results['pass_rate']:.1%} pass rate, ${cost:.2f}/mo")
8
9 # Output:
10 # gpt-4: 95% pass rate, $3,000/mo
11 # gpt-3.5-turbo: 87% pass rate, $150/mo
12 # claude-haiku: 82% pass rate, $40/moFind the sweet spot for your budget and quality requirements.Optimization Strategies:1. Prompt caching - Cache identical queries2. Smaller context - Trim unnecessary tokens3. Streaming - Start rendering before complete4. Model routing - Different models for different query typesExample:One client cut costs 70% by routing:• Simple FAQs → GPT-3.5 Turbo• Technical queries → GPT-4• Saved $2,100/month with no quality degradation
Tools and Framework
You don't need to build everything from scratch. Use these tools:Evaluation Platforms:LangSmith (our recommendation)• Built for LLM evaluation• Integrated tracing• Dataset management• Human labeling UI• Pricing: Free tier, then $50/moWeights & Biases• ML experiment tracking• Works for LLM evals• Great visualizations• Pricing: Free for individualsBraintrust• Purpose-built for LLM evals• Advanced eval functions• CI/CD integration• Pricing: Usage-basedPatronus AI• Enterprise eval platform• Advanced metrics• Compliance focus• Pricing: EnterpriseOpen Source:PromptTools - CLI for prompt testingInspect - Anthropic's eval frameworkOpenAI Evals - OpenAI's eval repositoryOur Stack:
1 # eval_framework.py
2
3 from langsmith import Client
4 from openai import OpenAI
5
6 class EvalFramework:
7 def __init__(self):
8 self.langsmith = Client()
9 self.openai = OpenAI()
10 self.eval_set = load_eval_set()
11
12 async def run_eval(self, prompt_version):
13 results = []
14
15 for test_case in self.eval_set:
16 response = await self.generate_response(
17 test_case["query"],
18 prompt_version
19 )
20
21 score = self.evaluate(test_case, response)
22
23 # Log to LangSmith
24 self.langsmith.create_run(
25 name=f"eval-{test_case['id']}",
26 inputs=test_case["query"],
27 outputs=response,
28 score=score
29 )
30
31 results.append({
32 "test_case": test_case["id"],
33 "score": score,
34 "response": response
35 })
36
37 return self.aggregate_results(results)1 # .github/workflows/eval.yml 2 name: LLM Evaluation 3 4 on: [pull_request] 5 6 jobs: 7 evaluate: 8 runs-on: ubuntu-latest 9 steps: 10 - uses: actions/checkout@v2 11 - name: Run Evaluation 12 run: | 13 python run_eval.py 14 - name: Check Pass Rate 15 run: | 16 if [ $(jq '.pass_rate' results.json) < 0.85 ]; then 17 echo "Eval pass rate below threshold" 18 exit 1 19 fi
Cost Estimate:• LangSmith: $50-200/mo• Eval compute: $20-50/mo• Human labeling: $500-1000/mo (contract work)Total: ~$600-1,250/month for production-grade eval infrastructure.Worth it to catch issues before production.
We help teams design and ship production-grade software in eLearning, fintech, and AI. Let's talk about your project.
Book a call