Critique of RAG at scale: the Curse of Dimensionality and why retrieval engineering is being skipped

The RAG Problem Nobody Wants to Admit

Every company I talk to is convinced that dumping more documents into their vector store makes their AI smarter. It doesn’t. Past a certain point, it actively makes things worse. And the engineering teams building these systems either don’t know it yet, or they know it and nobody wants to say it out loud.

The math isn’t subtle here. Let me walk through what’s actually happening.

The Curse of Dimensionality Is Not a Theory

In high-dimensional vector space, something counterintuitive happens as your dataset grows. Distances compress. When you’re working with embeddings in 768 or 1024 dimensions, adding tens of thousands of documents causes the mathematical gap between “relevant” and “irrelevant” to shrink toward zero. Your nearest-neighbor search stops finding the right answer and starts returning noise, with complete confidence.

This is the Curse of Dimensionality applied directly to production RAG. It’s not new math. Bellman identified this problem in the 1960s. We just conveniently forgot about it when vector databases became fashionable.

The numbers, from work that Stanford researchers have flagged in this space, are brutal. At roughly 10,000 documents, vector clusters start overlapping. At 50,000 documents, retrieval precision reportedly drops by 87%. At that scale, semantic search actually performs worse than old-school keyword search. The thing RAG was supposed to replace beats it.

What Actually Happens in Production

I’ve watched this pattern play out repeatedly. A team builds a RAG pipeline, it works beautifully at a few hundred documents, they celebrate, they present the demo, everyone loves it. Then they onboard the full knowledge base, five years of internal documentation, every support ticket, every product spec, every legal memo.

And the system gets dumb. Quietly, gradually dumb. It starts returning answers that are adjacent to correct. It confuses documents from different product lines. It confidently synthesizes contradictory chunks into a single coherent-sounding hallucination.

The team assumes it’s a prompt problem. They tune the prompt. It gets a little better. They assume it’s a chunking problem. They re-chunk. It gets a little better. What they’re actually fighting is a geometric property of the embedding space itself, and no amount of prompt engineering fixes that.

The Engineering Step Everyone Is Skipping

Retrieval engineering is a discipline. It includes things like hierarchical indexing, metadata filtering, hybrid retrieval combining dense and sparse methods, reranking with cross-encoders after the initial retrieval pass, and query decomposition before the vector lookup even runs.

Most teams I see skip all of it. They use a vanilla FAISS index or whatever their vector database defaults to, feed in full documents chunked at 512 tokens, and call it done. Then they wonder why the system degrades.

Reranking alone can recover a large portion of the precision you lose at scale. A cross-encoder reranker takes the top 50 retrieved chunks and re-scores them with full attention, not compressed vector math. It’s more expensive per query. It’s worth it. Hybrid retrieval, combining BM25 keyword scoring with semantic search, is also well-documented to outperform either method alone on most real-world corpora. These are not exotic techniques. They’re just not defaults.

Why Teams Skip This Step

Honestly, because retrieval feels solved. You add a vector database, you embed your documents, you ship. The demo works. The benchmark looks fine on the small test set. Nobody runs an adversarial evaluation at 50,000 documents before launch, because that would require building a ground-truth evaluation set, which takes time, which nobody budgeted.

The broader problem is that RAG got sold as a solution to hallucination. It was never that. It’s a context injection mechanism. Whether the injected context is accurate and relevant depends entirely on retrieval quality, which degrades predictably and mathematically as corpus size grows. RAG didn’t fix hallucinations. It relocated them from the model to the retrieval layer, where they’re harder to see.

What Good Retrieval Engineering Actually Looks Like

If you’re building a RAG system that will eventually exceed 10,000 documents, you need to plan for this from the start. That means hybrid retrieval with tuned BM25 weights, a reranking stage using a cross-encoder like Cohere Rerank or a fine-tuned BGE model, metadata filtering to constrain the search space before vector similarity runs, and a real offline evaluation harness with precision and recall metrics measured at target corpus size, not demo size.

It also means being honest about when RAG is the wrong tool. For some use cases, a well-structured traditional search system with explicit filtering will outperform a vector database at scale. That’s not a failure. That’s engineering.

The teams that figure this out will build systems that actually hold up in production. The ones that don’t will keep filing tickets wondering why their AI got dumber as the knowledge base grew.

🔎

Sources & Further Reading

#RAG #AIEngineering #MachineLearning #VectorSearch #RetrievalAugmentedGeneration #LLM #MLEngineering

Watch the full breakdown on YouTube

Critique of RAG at scale: the Curse of Dimensionality and why retrieval engineering is being skipped

Sources & Further Reading

Claude Code multi-agent architecture and what proper configuration actually unlocks for solo developers

Google Gemini 2.5 Pro tops coding benchmarks and delivers usable 1M token context window

Google Gemini 2.5 Pro tops coding benchmarks and delivers usable 1M token context window

Why feedback infrastructure, not model quality, is the real bottleneck in production AI systems

Contrarian take on context window size vs. context quality in LLM coding tools, using Claude Code’s Auto Dream feature as a jumping-off point

Leave a Reply Cancel reply

Sources & Further Reading

Similar Posts

Leave a Reply Cancel reply