AI

LLM Evaluation Playbook: How We Measure Quality Beyond 'Vibes'

Stop shipping LLM features on gut feel. Here's the eval stack we run on every project, from the first 200-example test set to the dashboards that page us at 3am.

NevkaSystems TeamEngineering

June 18, 2026  ·  12 min read

TL;DR

Replace vibes-based spot-checking with a real eval stack: a representative test set, layered automated metrics, targeted human review, live production monitoring, and versioned prompts rolled out gradually.

Key takeaways

1Build a 200-500 example eval set split across happy paths, edge cases, negatives, and adversarial inputs, and commit it to Git like any other source.

2Stack metrics instead of trusting one: exact match for structured output, semantic similarity for phrasing, and LLM-as-judge to cut grading costs about 80% versus humans.

3Reserve human review for tone, nuance, and new failure modes only, then save those ratings as golden labels for your automated checks.

4Monitor live traffic continuously; thumbs-down rate, regenerate clicks, and fallback rate move before the complaints do.

5Version prompts like code and roll out 10% to 50% to 100% with an instant rollback, and route cheap queries to cheap models to cut cost 70% with no quality loss.

"Looks good to me" is not a quality bar

Here's the pattern we see on nearly every LLM project before we get involved. A team builds a feature, runs a handful of prompts by hand, decides it feels right, and ships. Three weeks later support is on fire and nobody can say what changed. The feature didn't get worse overnight. It was never measured in the first place.

Manual spot-checking breaks for reasons that compound. Outputs are non-deterministic, so the same prompt gives you two different answers and you can't say which is correct. You test the happy path; users find the edge the moment you ship, and the first support ticket is something you never imagined. You check ten examples while production handles ten thousand queries a day, which means you've eyeballed roughly 0.1% of the input space. And every prompt tweak that fixes one thing quietly breaks three others you won't notice until a customer does.

Vibes don't let you measure improvement, catch regressions, debug failures, or optimize anything. The fix isn't more discipline while spot-checking. It's an eval set, automated metrics, humans sampling the hard cases, production monitoring, and versioned prompts. We run this on every LLM build, and the rest of this is how.

The eval set is the whole game

Garbage in, garbage out. If your eval set doesn't reflect real traffic, every number downstream is a lie you'll tell yourself with a straight face. Under 100 examples and you won't catch edge cases. We aim for 200 to 500 on production systems, and we mix them deliberately rather than grabbing whatever's convenient.

· Happy path, about 40%: the queries users actually send. "How do I reset my password?" "What's your refund policy?" These have to work, every time.

· Edge cases, about 30%: valid but awkward. "Can I refund a plan I upgraded yesterday?" Typos, bad grammar, non-English input.

· Negatives, about 20%: out-of-scope or off-limits. "What's the weather in Tokyo?" "Write my term paper." The model should decline cleanly.

· Adversarial, about 10%: prompt injection, absurdly long inputs, gibberish, anything built to break your parsing.

Build it from production logs if you have them, and mine your support tickets and error logs for failure cases. Those are the examples you'd never invent on your own. Treat the eval set as living documentation: commit it to Git, review it in PRs, update it as the product moves. It should be representative, specific about expected behavior, versioned, and maintained. Skip any of those and it rots.

Automated metrics that actually carry weight

Automated metrics are what let you run evals in CI and block a regression before it reaches a user. No single metric is enough, so we stack a few and match each to the kind of output it's good at.

· Exact match: use for structured outputs like JSON, codes, and categories. Useless for open-ended text.

· Semantic similarity: scores meaning, not wording. Above roughly 0.8 usually means good alignment. Use when an answer can be phrased many ways; avoid when exact details like instructions or codes have to be right.

· Keyword presence: cheap check that required facts made it into the output.

· LLM-as-judge: a strong model grading a weaker one. It reads nuance, handles open-ended generation, costs far less than human review, and is fast enough for CI. Using a stronger model to judge a cheaper one's outputs has saved clients around 80% versus human grading. The catch: the judge is consistent but not always correct, so re-validate it against human labels on a schedule.

· Task-specific checks: for code, does it run and pass tests? For summaries, ROUGE, compression ratio, and whether the key facts survived.

Run the full stack across your whole eval set and track pass rate over time. The trend line is what tells you whether last week's prompt change helped or quietly hurt.

Humans, but only where humans are worth it

Automated metrics are necessary and not sufficient. Tone, professionalism, nuanced correctness, and brand-new failure modes need a person. But human time is expensive, so we never review everything; we sample the cases most likely to be wrong or most expensive to get wrong.

Keep the rubric blunt: 5 is perfect, 4 is good with minor issues, 3 is acceptable, 2 is poor, 1 is flat wrong. When multiple raters disagree, inter-rater reliability is low and that's a signal the rubric is vague, not that the raters are bad, so refine it. We run a short weekly calibration on edge cases to keep the team aligned. Every human rating gets saved; today's judgment calls become tomorrow's golden labels for your automated metrics.

Monitoring doesn't stop at deploy

The eval set tells you about yesterday's traffic. Production tells you about today's. The leading indicators of quality trouble show up in behavior before they show up in complaints: thumbs-down rate, regenerate clicks, abandoned conversations, how fast users correct the model. Response shape matters too. Replies drifting long means verbose, drifting short means incomplete, and a climbing fallback rate means the model is punting more often than it should.

Run your eval suite against a sample of live queries, and watch for drift from model updates, prompt edits, or shifts in what users are asking. On the dashboard we keep pass rate, average score, latency at p50/p95/p99, cost per query, and the thumbs up/down ratio. We page someone when pass rate falls more than 5%, latency climbs more than 20%, or satisfaction drops more than 10%.

Version prompts like code, roll out like you mean it

A prompt is code that happens to be in English. Version it in Git, link changes to pull requests, and never promote a new version until it beats the current one on your eval set. "It seems better" is how regressions ship.

Production rollout is gradual on purpose: eval set first, then 10% of users, watch for 24 to 48 hours, move to 50%, watch again, then 100% only if the numbers hold. Keep an instant rollback path the whole time. The point of staging isn't caution for its own sake; it's catching the failure on 10% of traffic instead of all of it.

Cost, quality, latency: pick where you spend

Every model choice trades cost against quality and speed, and there's no setting that wins all three. As a rough map: a top-tier model runs $30 to $60 per million tokens, is excellent, and is slow, so save it for decisions that matter. A mid-tier model at $1 to $2 is good and fast, which covers most queries. A small fast model around $0.25 to $0.50 handles high volume where good-enough is genuinely enough.

The cheapest real lever is routing, not picking one model for everything. One client cut cost 70% by sending simple FAQs to the mid-tier model and reserving the top-tier one for technical questions: $2,100 a month saved, no measurable quality drop. Stack the usual optimizations on top, prompt caching for repeat queries, trimming dead context, streaming so rendering starts early, and you find the sweet spot for your budget instead of guessing at it.

What we actually run

You don't need to build the harness yourself. We lean on LangSmith for tracing, dataset management, and a human-labeling UI; it has a free tier and runs about $50/month after. Weights & Biases works well if your team already lives in it, Braintrust is purpose-built for evals with solid CI integration, and Patronus AI fits enterprise and compliance needs. On the open-source side, PromptTools handles CLI prompt testing, Inspect is Anthropic's eval framework, and OpenAI Evals is worth raiding for examples.

Budget-wise, production-grade eval infrastructure runs roughly $600 to $1,250 a month: $50 to $200 for the platform, $20 to $50 for eval compute, and $500 to $1,000 for human labeling, usually contract work. That sounds like a lot until you price out one bad regression sitting in front of users for three weeks. The eval setup is cheaper, every time.

Want help implementing this?

We help teams design and ship production-grade software in eLearning, fintech, and AI. Let's talk about your project.

Book a call

Related articles

AI

Choosing the Right RAG Architecture: Vector Search vs Hybrid vs Graph

June 18, 2026 · 9 min read

AI

Multi-Agent Systems in Production: What Breaks First

June 18, 2026 · 12 min read

AI

Observability for AI Apps: Traces, Costs, Hallucinations, and Feedback Loops

June 18, 2026 · 12 min read

← All insights

AI