AI

Choosing the Right RAG Architecture: Vector Search vs Hybrid vs Graph

Vector, hybrid, or graph, the choice decides whether your retriever answers the hard questions or quietly fails on them. Here's how all three behave in production.

NevkaSystems TeamEngineering

June 18, 2026  ·  9 min read

TL;DR

Default to hybrid search for production RAG; it delivers ~80% of graph RAG's accuracy gain at ~20% of the complexity, with vector-only for MVPs and graph reserved for genuinely relationship-heavy queries.

Key takeaways

1Vector-only is fast and cheap (50-150ms, $0.10-0.30 per 1k) but misses exact matches like SKUs and error codes, keep it for MVPs and conversational content.

2Hybrid (vector + BM25) buys 30-50% better accuracy for ~50-100ms more latency; it's our default for production. One fintech client saw a 43% drop in "no answer found."

3Graph-augmented is the only honest answer for multi-hop relationship queries, but costs 3-5x more and needs graph expertise, reserve it for citation networks, org graphs, and entity-heavy domains.

4Start at 70% vector / 30% keyword for hybrid, then tune the weights against your actual failed queries.

5Migrate by data, not vibes: vector-only to validate, hybrid when accuracy blocks you, graph only once relationship queries show up. Instrument everything from day one.

"RAG" gets thrown around like it means one thing. It doesn't. It's a family of architectures, and the one you pick decides whether your system answers questions or quietly fails on the ones that matter. We've shipped all three of the variants below to production, customer support, legal document analysis, fintech, pharma research. Pick wrong and you'll burn months wondering why the retriever keeps whiffing on obvious answers.

Three architectures, three sets of tradeoffs: vector-only (pure semantic similarity), hybrid (semantic plus keyword), and graph-augmented (knowledge graphs for traversing relationships). Here's how each behaves under load, what it costs, and when we reach for it.

Vector-only: the default everyone starts with

Chunk documents into 500-1000 token pieces, embed them (OpenAI, Cohere, or an open-source model), store the vectors, and at query time embed the query and grab the nearest neighbors. Pass the top K chunks to the model. You can stand this up in a weekend, and retrieval is sub-100ms on most vector DBs. It handles paraphrasing and synonyms naturally, which is exactly what you want for conversational content.

Where it falls apart: exact matches. Ask for product SKU "E4201" or a specific error code and semantic similarity has nothing useful to grab onto. Acronyms and dense technical terms suffer the same way. It also has no sense of document structure or relationships, so any question that needs two separate facts stitched together tends to fail. We built a help-center knowledge base on vector-only that nailed "How do I export data?" and completely missed "What's the difference between plan A and plan B?", the answer lived in two documents and nothing pulled them together.

Reach for vector-only when you're shipping an MVP, the content is narrative, and semantic similarity is genuinely enough. Past that, it gets thin fast.

Hybrid: our default for production

Run a vector index and a keyword index (BM25, usually Elasticsearch or OpenSearch) side by side. At query time, search both in parallel, then merge with weighted scores or reciprocal rank fusion and dedupe before handing chunks to the model. The two cover each other's blind spots: vector search finds semantically related content, BM25 catches the literal terms. "error code E4201", vector misses, BM25 nails it. "payment not working", BM25 is useless, vector finds the related issues. You want both.

On our benchmarks this buys 30-50% better accuracy than vector-only, and it's especially strong on technical docs, product catalogs with model numbers, and compliance text with exact citations. For one fintech client, hybrid cut "no answer found" responses by 43%, "What's the fee for wire transfers?" (semantic) and "Regulation E disclosures" (exact) both landed.

The cost is real but small: two indexes to maintain, ~50-100ms more latency, and some experimentation to tune the score blend. We start at 70% vector / 30% keyword, then watch which searches fail and shift the weights from there. Don't guess the ratio, instrument it.

Graph-augmented: when the question is about relationships

Extract entities and relationships from your documents, store them in a graph DB (Neo4j, Amazon Neptune), and at query time identify the entities in the question, traverse the graph to related nodes, then feed both the graph context and the connected text chunks to the model. This is the only one of the three that handles multi-hop reasoning honestly.

Take "Which projects did the engineering team work on that involved the payment system?" That's three entity types and the edges between them, people, projects, systems. Vector search flails; graph RAG walks Team → Members → Projects → Systems and gets there. We built this for a pharma research database where queries like "What drugs interact with compounds tested in phase 2 trials for autoimmune diseases?" meant traversing Drug → Trial → Disease → Related Drugs. Nothing else would have answered it.

The price is steep. You need an entity-extraction pipeline, graph expertise on the team, 3-5x the latency of vector search, and meaningfully higher cost. It earns its keep for citation networks, org-structure questions, financial entity analysis, legal precedent, and drug/disease interactions, places where the relationships ARE the answer. Outside those, it's overkill.

What the numbers actually look like

From our deployments, roughly: vector-only runs 50-150ms and costs $0.10-0.30 per 1000 queries on simple infra; hybrid runs 100-250ms at $0.15-0.40 with a second index to keep; graph-augmented runs 300-800ms at $0.50-1.50 and wants 4-6 services plus a graph DB. On domain-specific queries, user satisfaction lands around 60-70% for vector-only, 80-90% for hybrid, and 85-95% for graph on the relationship-heavy questions it's built for.

· Vector-only earns its spot on "How do I reset my password?" and general conversational queries.

· Hybrid owns the technical lookups: "Show me error E4201 documentation," "Find GDPR Article 13."

· Graph wins the connected questions: "Which engineers worked on projects with the auth team?", "What papers cite this research and share authors?"

How to choose

Start with hybrid for any real production system, content with technical terms or codes, anything where you need to clear 70% accuracy and can afford a bit more infra. Stay on vector-only only when it's an MVP, the content is conversational, the budget is razor-thin, and you have to ship in days. Go graph when your queries keep asking "who," "which," or "what's connected," you have clean structured entities, you need 90%+ accuracy, and you've got the expertise and budget to back it.

Watch for the vector-only warning signs: users complaining about missing "obvious" answers, queries full of codes and IDs, a specialized domain (legal, medical, financial), or accuracy stuck below 75%. Those are the signals to move up a tier.

Stack and migration path

For hybrid on OpenSearch: install it with the k-NN plugin, create an index with both vector and text fields, index documents with their embeddings, and query across both. Vector DBs worth knowing: Pinecone, Weaviate, Qdrant, Milvus. Hybrid: OpenSearch, Elasticsearch 8+, Vespa. Graph: Neo4j, Neptune, TigerGraph. Embeddings: OpenAI, Cohere, Sentence Transformers. To keep cost down, cache embeddings hard, use smaller embedding models for simple content, batch your vector ops, and self-host the vector DB once you're at scale.

The migration path that works: start vector-only to validate the use case, instrument everything and track failed queries, move to hybrid when accuracy becomes a blocker, and add graph only after you actually see relationship queries piling up. Most teams over-engineer on day one. Start simple, measure, and upgrade when the data says you have to.

Our honest take: hybrid gives you about 80% of the accuracy gain at 20% of the complexity of graph RAG. Start there unless you have a specific reason not to. And whichever you pick, instrument it from day one, query patterns, failed searches, satisfaction. Let that drive the architecture, not the hype. We've built all three more times than we can count, and that's still the advice.

Want help implementing this?

We help teams design and ship production-grade software in eLearning, fintech, and AI. Let's talk about your project.

Book a call

Related articles

AI

Multi-Agent Systems in Production: What Breaks First

June 18, 2026 · 12 min read

AI

LLM Evaluation Playbook: How We Measure Quality Beyond 'Vibes'

June 18, 2026 · 12 min read

AI

Observability for AI Apps: Traces, Costs, Hallucinations, and Feedback Loops

June 18, 2026 · 12 min read

← All insights

AI