Critique of RAG at scale: the Curse of Dimensionality and why retrieval engineering is being skipped
| | |

Critique of RAG at scale: the Curse of Dimensionality and why retrieval engineering is being skipped

The RAG Problem Nobody Wants to Admit

Every company I talk to is convinced that dumping more documents into their vector store makes their AI smarter. It doesn’t. Past a certain point, it actively makes things worse. The math here is not subtle, and the engineering community is mostly pretending it isn’t happening.

Let me explain why.

The Curse of Dimensionality Is Not a Hypothetical

When you build a RAG pipeline, each document gets converted into a high-dimensional vector, typically somewhere around 1,000 to 1,536 dimensions depending on your embedding model. In that space, similarity search works by finding vectors that are “close” to your query vector. Sounds reasonable.

The problem is a well-documented phenomenon in computational geometry: as dimensionality increases, the volume of the space grows so fast that points become equidistant from each other. Your nearest neighbor is barely closer than your farthest one. The geometry collapses.

In a 1,000-dimensional space, 99.9% of your data lives on the outer shell. Everything clusters at roughly the same distance from everything else.

This is not theoretical. According to research surfaced from Stanford’s AI Index, at 50,000 documents, precision in semantic search drops by 87%. At that scale, traditional keyword search actually outperforms vector similarity. The system you built to be smarter than grep becomes worse than grep.

The Pattern I Keep Seeing in Real Deployments

A team builds a RAG pipeline. It works beautifully at a few hundred documents. The retrieval is sharp, the answers are good, the demo impresses everyone. They ship it, celebrate, and then onboard the full knowledge base.

Performance quietly degrades. The AI starts hallucinating more, not less. Answers become confident and wrong in that specific way that makes users distrust the whole system.

The team’s first instinct is usually to tune the LLM. Adjust temperature, tweak the prompt, swap models. None of it helps, because the problem is upstream. The retrieval layer is returning noise with the same confidence score it used to return signal.

This is what I’d call “semantic collapse” in practice. The embedding space gets saturated, clusters overlap, and the top-k results returned to your context window are essentially random.

Why Retrieval Engineering Gets Skipped

Here’s my honest read on why this keeps happening: retrieval engineering is boring, it’s hard to demo, and it doesn’t show up in the benchmark your VP cares about.

Vector similarity search feels solved because the libraries are mature. Pinecone, Weaviate, Chroma, pgvector. They’re all polished. You can get a working demo in an afternoon. That success at small scale creates the illusion that the architecture will hold at production scale.

It won’t, not without real engineering on the retrieval side.

What actually needs to happen: chunking strategies that respect document semantics rather than token counts, metadata filtering to constrain the search space before the vector similarity step even runs, re-ranking with a cross-encoder after initial retrieval, hybrid search that combines sparse and dense retrieval, and query expansion to handle the vocabulary mismatch between how users ask questions and how documents are written.

None of that is in the tutorial. All of it matters.

What Good Retrieval Actually Looks Like

The teams getting this right are not just throwing better embeddings at the problem. They’re treating retrieval as a multi-stage pipeline.

Stage one: narrow the candidate set using metadata, filters, or sparse keyword search. Get it down to a few thousand candidates before you do any vector math.

Stage two: vector similarity on the reduced candidate set. Now the geometry problem is manageable.

Stage three: cross-encoder re-ranking on the top 20-50 results. This is computationally expensive, which is why you don’t do it on the full corpus.

That architecture holds at scale. The naive architecture, query goes straight to full-corpus ANN search, breaks predictably around 10,000 to 50,000 documents depending on the embedding space and domain.

The math is not going to change. The Curse of Dimensionality is not a bug anyone is going to patch. It’s geometry. Your job as an engineer is to build around it.

The industry shipped RAG as a hallucination fix. What it actually is, when done properly, is a retrieval system that happens to feed an LLM. Retrieval systems are hard. They always were. Wrapping one in a chat interface didn’t change that, it just added another failure mode on top.

Build the retrieval layer like it matters. Because right now, for most production RAG systems, it’s the only thing that does.

Sources & Further Reading

#RAG #VectorSearch #AIEngineering #MachineLearning #LLM #RetrievalAugmentedGeneration #MLOps

Watch the full breakdown on YouTube

Sources & Further Reading

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *