AI

RAG in Production: 5 Pitfalls We Learned the Hard Way

Five problems that don't show up in the demo and always show up in production, plus the fixes that actually moved our numbers.

NevkaSystems TeamAI Engineering

June 18, 2026  ·  8 min read

TL;DR

RAG breaks in production in five predictable ways, and the fixes are about chunking, hybrid search, context discipline, latency, and guardrails, not a better model.

Key takeaways

1Chunk by document structure, not token count: it lifted our retrieval precision 35% with no model change

2Hybrid search (vector + BM25 via RRF) took exact-match accuracy from 60% to 95%

3Aggressive reranking and compression cut context size 60% while improving answers

4Streaming plus caching dropped time-to-first-token from 2.4s to under 700ms

5RAG reduces hallucinations but never eliminates them: enforce citations, confidence scoring, and out-of-scope detection

A RAG demo takes an afternoon. A RAG system that survives real users takes months, and most of that time goes into problems the tutorials never mention. We've shipped retrieval-augmented systems for fintech and eLearning clients, and the same five things bite every time. Here's what they are and what we actually did about them.

Pitfall 1: Naive chunking wrecks retrieval before you start

The first mistake happens before the vector database is even in the picture. Most guides tell you to slice documents into fixed 500-token chunks and move on. That works in a demo and falls apart in production, because fixed-size chunks ignore where meaning actually lives.

The usual ways it breaks:

· Fixed 500-token chunks with no regard for what's inside them

· Zero overlap, so a chunk starts cold with no preceding context

· Cuts landing mid-sentence or mid-paragraph

· Document structure (headings, lists, code blocks) thrown away entirely

A chunk that begins halfway through a paragraph has no anchor. A chunk that splits a code example is dead weight. Feed those to the model and it has to guess what it's looking at. So we stopped chunking by token count and started chunking by structure: parse the document first to find headings, paragraphs, and code blocks; keep semantic units whole instead of splitting them; add 50 to 100 tokens of overlap so each chunk carries some lead-in; and store the section heading and hierarchy as metadata on every chunk. That one change lifted retrieval precision 35% on our internal benchmarks. No model swap, no reranker, just better boundaries.

Pitfall 2: Pure vector search misses the obvious

Embeddings capture meaning in a way keyword search never could. They also quietly fail on the things users care about most. Embedding models are tuned for semantic similarity, not exact matching, so a query for "error code E-1042" pulls back chunks about errors in general. Close in vector space, useless to the person asking.

Where vector-only search let us down:

· Exact strings like product names and error codes slipping through

· Niche technical terms returning vaguely related noise

· Specific queries getting general answers

The fix is unglamorous and it works: run BM25 keyword search alongside vector search and merge the two with reciprocal rank fusion. RRF is about as simple as result merging gets, and it covers both semantic relevance and literal matches. Our exact-match query accuracy went from 60% to 95% once hybrid search was in place. If you only change one thing after chunking, change this.

Pitfall 3: Good retrieval blows up your context window

Once retrieval works, you get a new problem: too much relevant content. Ten solid chunks at 500 tokens each is 5,000 tokens before the system prompt, the user query, and room to answer. An 8K window sounds roomy until you actually fill it, and metadata and formatting eat into it too.

Stuffing the window doesn't just cost more, it makes answers worse. Details get buried, and the lost-in-the-middle effect is real: models genuinely skim past content parked in the center of a long context. What worked for us:

· Rerank hard. A cross-encoder or Cohere Rerank to pick the top 3 to 5 chunks, not the top 10

· Compress what survives. Summarize or pull key sentences instead of pasting whole chunks

· Size context to the question. Simple queries get less, complex ones get more

· Retrieve hierarchically. Find the right documents first, then the right chunks inside them

Smarter selection cut our average context size 60% and improved answer quality at the same time. Less really was more here.

Pitfall 4: Latency dies by a thousand cuts

Every stage of a RAG pipeline looks fast on its own. Stacked end to end, they don't. Embed the query, 50ms. Vector search, 100ms. Rerank, 200ms. Generation, 2,000ms. That's 2.35 seconds on the happy path, and the happy path is fiction once you add retries, error handling, and a flaky network. Real P95 lands around 3 to 5 seconds, which feels broken to anyone expecting an instant answer.

What pulled the numbers down:

· Cache embeddings and search results. The same queries hit you over and over

· Parallelize independent work, like multi-index or document-then-chunk retrieval

· Stream tokens so the user sees an answer forming instead of a spinner

· Use a lighter reranker. Cross-encoders are accurate and slow; a distilled model or DB-side reranking often pays off

· Batch embedding calls when you're processing more than one

Together these took our time-to-first-token from 2.4 seconds to under 700ms. Streaming did most of the perceived-speed work; the rest closed the gap on the real numbers.

Pitfall 5: RAG does not kill hallucinations

The most expensive assumption people make is that grounding the model in retrieved text stops it from making things up. It doesn't. RAG lowers the rate. It does not zero it out, and the model will still invent answers when the context is thin, contradictory, or off-topic.

Where the made-up answers came from:

· Retrieved chunks that contradict each other

· The model ignoring the context it was given, especially in the middle of a long one

· Questions that fall outside the knowledge base entirely

· Citations attached to the wrong source

So we stopped trusting the model and started constraining it. Force a citation for every claim; if it can't cite, it doesn't get to assert. Turn retrieval scores into a confidence signal, since low scores mean a shaky answer. Check relevance before generating, and if the chunks don't fit the question, say "I don't have information about that" instead of guessing. For high-stakes answers, route to a human before it reaches the user. None of this is airtight, but it cuts hallucinations sharply and makes the ones that slip through far easier to catch.

The production checklist

Before any RAG system goes live, we run it against this. Testing: real user queries, not happy paths; precision and recall on a labeled set; edge cases like typos, multiple languages, and domain jargon; a load test at expected concurrency. Observability: log every query with its retrieval results; track latency per stage; monitor cost across embeddings and LLM calls; alert on quality drift. Quality: a thumbs up/down feedback path; A/B infrastructure for prompt and retrieval changes; a review loop for low-confidence answers; scheduled audits. Operations: document the chunking strategy and why; write runbooks for slow retrieval and hallucination spikes; plan for index updates as source documents change; test your rollback before you need it.

The real lesson from running these in production: RAG is not set-and-forget. It's a product that needs monitoring, feedback, and steady iteration. The teams that win treat it that way. The teams that ship it once and walk away are the ones whose users stop trusting the answers.

Want help implementing this?

We help teams design and ship production-grade software in eLearning, fintech, and AI. Let's talk about your project.

Book a call

Related articles

AI

Choosing the Right RAG Architecture: Vector Search vs Hybrid vs Graph

June 18, 2026 · 9 min read

AI

Multi-Agent Systems in Production: What Breaks First

June 18, 2026 · 12 min read

AI

LLM Evaluation Playbook: How We Measure Quality Beyond 'Vibes'

June 18, 2026 · 12 min read

← All insights

AI