RETRIEVAL

Choosing a Vector Database for Production RAG: What Actually Matters

MAY 14, 2025

14 MIN READ

289 Likes

If you've followed the AI infrastructure space over the last two years, you've seen a vector database proliferate from a niche specialized tool to a crowded market with dozens of options. Pinecone, Weaviate, Qdrant, Milvus, pgvector, Chroma, Redis Vector — each with its own benchmark charts showing it outperforms the competition on the dimensions it was optimized to outperform. Evaluating these options for a production RAG deployment requires cutting through the marketing to the dimensions that actually matter for your workload.

Dimensions That Matter

Query latency at your index size. Vendors publish impressive benchmark numbers at specific index sizes. What matters is latency at your index size, which may be very different. A vector database that returns results in 5ms for 1M vectors but 800ms for 50M vectors is the wrong choice if your corpus is 50M vectors. Get benchmark data at your scale, or run your own benchmarks.

Hybrid search support. As discussed in our hybrid RAG post, production retrieval quality almost always benefits from combining dense and sparse retrieval. Not all vector databases support BM25 or other sparse retrieval natively. Databases that only support dense retrieval require you to maintain a separate sparse index (Elasticsearch, OpenSearch, or similar), adding operational complexity and a runtime join that adds latency.

Metadata filtering performance. Real enterprise RAG queries almost always have metadata filters: only search documents from this department, only search documents modified in the last 90 days, only search documents with this classification level. Metadata filtering performance varies enormously between databases. Some implement it as a pre-filter (filter before searching the vector index), some as a post-filter (search the full index then filter results), and some as a hybrid. Pre-filtering is generally faster but requires careful index design; post-filtering is simpler but produces poor results when the filter is highly selective.

Operational simplicity. This is consistently underweighted in database evaluations. A database that requires a team of engineers to operate, tune, and monitor is not the right choice for a two-person startup or a company that doesn't have dedicated database infrastructure expertise. Managed cloud offerings reduce operational burden substantially; self-hosted options require significant operational investment.

Dimensions That Are Mostly Marketing

Raw throughput benchmarks. Top-of-funnel RAG queries are not high-throughput workloads. Unless you're building a search engine or a system processing thousands of queries per second, throughput is not your bottleneck. Latency and accuracy matter more.

Multimodal claims. Several databases advertise multimodal vector support (images, audio, video alongside text). This is real functionality, but the RAG workflows that benefit from multimodal retrieval are still rare in production enterprise deployments. Don't let multimodal capability sway your decision unless you have a specific multimodal use case.

What Agentica Uses

Agentica's RAG server is built on Qdrant for vector storage, with hybrid search implemented using Qdrant's native sparse vector support combined with BM25 indexing. We chose Qdrant for its strong hybrid search implementation, competitive performance at mid-scale index sizes, and active development. The graph retrieval layer uses a separate Neo4j deployment. This architecture handles the retrieval requirements of Agentica's target use cases — enterprise document corpora of 10K to 10M documents — with acceptable latency and operational complexity.

Deploy Strategic Intelligence

Schedule a technical briefing on multi-agent deployment patterns.

Contact Engineering

Similar Research

View All Logs

RETRIEVAL

Why Hybrid RAG Beats Pure Semantic Search in Production

Dense embeddings miss exact matches. BM25 misses conceptual similarity. Graph traversal connects entities neither can reach. Here's how combining all three — fused with RRF and re-ranked by a Cross Encoder — produces retrieval quality that standalone methods simply can't match.

Analyze Report →

INFRASTRUCTURE

Model-Agnostic Architecture: Routing LLMs by Task, Cost, and Latency

Locking your agent stack to a single LLM provider is an architectural mistake. Here's how to design a model-agnostic layer that routes tasks to the right provider based on capability requirements, cost constraints, and latency targets.

Analyze Report →