RAG in Production: 5 Pitfalls We Learned the Hard Way
Building AI-powered knowledge bases sounds simple until you hit production. Here are the real challenges we faced with retrieval-augmented generation and how we solved them.
RAG systems look simple in demos but break in production. We learned 5 hard lessons: chunk size matters more than you think, retrieval quality beats LLM quality, context window limits are real, latency compounds fast, and hallucinations still happen.
Pitfall 1: Naive Chunking Strategies
The most common RAG mistake happens before you even touch a vector database: chunking. Most tutorials suggest splitting documents into fixed-size chunks—say, 500 tokens—and calling it a day. In demos, this works fine. In production, it fails spectacularly.Here's what goes wrong with naive chunking:Common mistakes:- Fixed 500-token chunks without considering context- No overlap between chunks- Splitting mid-sentence or mid-paragraph- Ignoring document structure (headings, lists, code blocks)The fundamental problem is that fixed-size chunks don't respect semantic boundaries. A chunk that starts mid-paragraph lacks context. A chunk that splits a code example becomes useless. When your retrieval returns these broken chunks, the LLM struggles to make sense of them.Our solution: Semantic-aware chunkingWe now chunk based on document structure, not arbitrary token limits. The approach:1. Parse document structure first — identify headings, paragraphs, code blocks, lists2. Keep semantic units together — don't split paragraphs or code blocks3. Add overlap — include 50-100 tokens of overlap between chunks for context4. Preserve metadata — store section headings and document hierarchy with each chunk
1 def semantic_chunk(text: str, max_tokens: int = 500, overlap: int = 50): 2 # Split by semantic boundaries(paragraphs, headers) 3 sections = split_by_structure(text) 4 chunks = [] 5 current_chunk = "" 6 7 for section in sections: 8 if count_tokens(current_chunk + section) <= max_tokens: 9 current_chunk += section 10 else: 11 if current_chunk: 12 chunks.append(current_chunk) 13 current_chunk = get_overlap(chunks[-1], overlap) + section if chunks else section 14 15 if current_chunk: 16 chunks.append(current_chunk) 17 18 return chunks
The result: retrieval precision improved by 35% on our internal benchmarks just from better chunking.
Pitfall 2: Pure Vector Search Fails
Vector search is powerful—it captures semantic similarity in ways keyword search never could. But relying on vector search alone will hurt your RAG system's precision.Problems we encountered:- Missing exact keyword matches (product names, error codes)- Poor performance on niche technical terms- Irrelevant results for specific queriesThe issue is that embedding models optimize for semantic similarity, not exact matching. If a user searches for "error code E-1042", vector search might return chunks about errors in general—semantically similar, but not what the user needs.Hybrid search wins:We combine vector search with BM25 keyword search, then merge the results. This approach captures both semantic relevance AND exact matches.
1 def hybrid_search(query: str, k: int = 10): 2 # Get vector results 3 vector_results = vector_db.search(embed(query), k=k*2) 4 5 # Get keyword results 6 bm25_results = keyword_index.search(query, k=k*2) 7 8 # Reciprocal rank fusion to merge results 9 return reciprocal_rank_fusion(vector_results, bm25_results, k=k)
We use reciprocal rank fusion (RRF) to merge results—it's simple and works well. After implementing hybrid search, our exact-match queries went from 60% accuracy to 95%.
Pitfall 3: Context Window Explosion
When RAG works too well, you hit a new problem: too much relevant content. Your retrieval returns 10 highly relevant chunks, each 500 tokens. That's 5,000 tokens before you even add the system prompt, user query, and output space.Reality check:- 8K tokens sounds like a lot, but fills up fast- Top 10 chunks often exceed context limits- Metadata and formatting consume tokens too- You need space for the model to generate a responseStuffing too much context hurts quality. The LLM gets overwhelmed, important details get lost in the middle (the "lost in the middle" problem is real), and costs skyrocket.Strategies that work:1. Rerank aggressively — Use a reranker (like Cohere Rerank or a cross-encoder) to select the top 3-5 most relevant chunks, not 102. Compress context — Summarize or extract key sentences from retrieved chunks3. Dynamic context sizing — Simple queries need less context; complex queries need more4. Hierarchical retrieval — First retrieve at the document level, then chunk level within top documents
1 def smart_context_selection(query: str, chunks: list, max_tokens: int = 3000): 2 # Rerank to get best chunks 3 reranked = reranker.rerank(query, chunks) 4 5 # Select chunks until token limit 6 selected = [] 7 token_count = 0 8 for chunk in reranked: 9 chunk_tokens = count_tokens(chunk.text) 10 if token_count + chunk_tokens > max_tokens: 11 break 12 selected.append(chunk) 13 token_count += chunk_tokens 14 15 return selected
After implementing smart context selection, we reduced average context size by 60% while improving answer quality.
Pitfall 4: Death by a Thousand Cuts (Latency)
RAG pipelines have many stages, and latency compounds at each step. What seems fast in isolation becomes painfully slow end-to-end.Typical RAG pipeline latency breakdown:| Stage | Time ||-------|------|| Embedding query | 50ms || Vector search | 100ms || Reranking | 200ms || LLM generation | 2000ms || Total | 2.35s |2+ seconds feels slow for users expecting instant answers. And this is the happy path—add error handling, retries, and network hiccups, and you're looking at 3-5 second P95 latency.Optimizations that work:1. Cache embeddings — Same queries hit your system repeatedly. Cache the embedding and search results.2. Parallelize where possible — If you're searching multiple indexes or doing document-level then chunk-level retrieval, parallelize.3. Stream LLM responses — Don't wait for the full response. Stream tokens to the user as they're generated.4. Use faster models for reranking — Cross-encoders are accurate but slow. Consider distilled models or moving reranking to the vector DB.5. Batch embedding requests — If you're embedding multiple queries or chunks, batch them.
1 # Before: Sequential processing 2 result = await embed(query) # 50ms 3 chunks = await search(result) # 100ms 4 reranked = await rerank(chunks) # 200ms 5 response = await generate(reranked) # 2000ms 6 # Total: 2350ms 7 8 # After: Cache + streaming 9 cached_result = cache.get(query) or await embed(query) # 5ms (cache hit) 10 chunks = await search(cached_result) # 100ms 11 reranked = await rerank(chunks) # 200ms 12 async for token in stream_generate(reranked): # First token: 300ms 13 yield token 14 # Time to first token: 605ms
With these optimizations, we reduced time-to-first-token from 2.4s to under 700ms.
Pitfall 5: Hallucinations Still Happen
The biggest misconception about RAG is that it eliminates hallucinations. It doesn't. RAG reduces hallucinations by grounding the LLM in retrieved context, but the LLM can still make things up.Common causes of RAG hallucinations:- Retrieved context contradicts itself- LLM ignores provided context (especially in the middle of long contexts)- User asks questions outside the knowledge base- Citation misattribution (LLM cites the wrong source for a claim)Guardrails we implemented:1. Citation tracking — Force the model to cite sources for each claim. If it can't cite, it shouldn't state.2. Confidence scoring — Use the retrieval scores to estimate confidence. Low retrieval scores = uncertain answer.3. Out-of-scope detection — Before generating, check if retrieved chunks are relevant. If not, say "I don't have information about that."4. Human-in-the-loop — For high-stakes answers, flag for human review before sending to users.
1 def generate_with_guardrails(query: str, chunks: list): 2 # Check retrieval quality 3 if all(chunk.score < 0.5 for chunk in chunks): 4 return "I don't have specific information about that in my knowledge base." 5 6 # Generate with citation requirement 7 response = llm.generate( 8 system="Answer based ONLY on the provided context. Cite sources using [1], [2], etc.", 9 context=format_chunks(chunks), 10 query=query 11 ) 12 13 # Validate citations exist 14 if not has_valid_citations(response, chunks): 15 return flag_for_review(response) 16 17 return response
These guardrails won't eliminate all errors, but they reduce hallucinations significantly and make errors easier to catch.
Production Checklist
Before deploying RAG to production, run through this checklist:Testing:- [ ] Test with real user queries (not just happy paths)- [ ] Measure retrieval precision and recall on a labeled dataset- [ ] Test edge cases: queries with typos, multi-language, domain jargon- [ ] Load test at expected scale (concurrent users)Observability:- [ ] Implement query logging with retrieval results- [ ] Track latency at each pipeline stage- [ ] Add cost monitoring (embeddings + LLM calls)- [ ] Set up alerts for quality degradationQuality:- [ ] Implement user feedback mechanism (thumbs up/down)- [ ] Build A/B testing infrastructure for prompt/retrieval changes- [ ] Create human-in-the-loop review process for low-confidence answers- [ ] Set up regular quality auditsOperations:- [ ] Document your chunking strategy and rationale- [ ] Create runbooks for common issues (slow retrieval, high hallucination rate)- [ ] Plan for index updates as source documents change- [ ] Test rollback proceduresThe key lesson from production: RAG is not a set-and-forget system. It requires continuous monitoring, feedback loops, and iteration. The companies that succeed with RAG treat it as a product to be improved, not a feature to be shipped.
We help teams design and ship production-grade software in eLearning, fintech, and AI. Let's talk about your project.
Book a call