RAG and agentic search both have a context problem

Executive Summary

RAG and agentic search are two opposing responses to the same problem: getting the right information into an LLM's context window. RAG gives you too little (fragmented chunks). Agentic search gives you too much (information flooding). Both assume the problem is retrieval. It isn't. It's data architecture.

The RAG chunking problem

RAG has become the default architecture for enterprise AI, and the core idea is sound: instead of relying on the model's training data, retrieve relevant documents at query time and inject them into the context. In practice, this means splitting documents into chunks (typically 256-1024 tokens), embedding them as vectors, and retrieving the top matches at inference time.

The problem is in the chunking. When you split a document into fixed-size segments, you cut across semantic boundaries. A paragraph explaining a financial policy gets split mid-sentence. A table loses its headers. A list of conditions gets separated from the clause they modify. The retrieved chunk scores well on vector similarity but arrives semantically incomplete.

"Vector similarity measures topical relevance, not semantic completeness. A chunk can be maximally similar to a query while being minimally useful for answering it."
— Token Theory

This is well-documented in the RAG community. A significant proportion of retrieved contexts end up "semantically orphaned," as practitioners describe it: relevant to the topic but missing the surrounding context needed for accurate reasoning. You can increase chunk overlap to mitigate this, but that introduces redundancy and inflates token costs without proportionally improving accuracy.

To be fair, techniques like semantic chunking, GraphRAG, and document hierarchies have made real progress here. But they're ultimately better implementations of the same retrieval paradigm. They improve how you find and cut the data. They don't solve the underlying problem of what that data looks like before the search even starts.

The agentic search flooding problem

Agentic search emerged as a response to RAG's limitations. Instead of retrieving pre-chunked fragments, an agent iteratively searches, reads, and synthesises information, more like how a human researcher would approach a complex question. Systems like Perplexity, OpenAI's deep research mode, and various enterprise agent frameworks take this approach.

You get significantly more context this way, often filling 50-80% of the available window. And this is where a different problem shows up.

The Context Spectrum

Liu et al.'s "Lost in the Middle" paper (published in TACL, 2024) showed conclusively that LLMs struggle to use information placed in the middle of long contexts. Performance degrades when the answer is buried among large volumes of supporting text. The model doesn't ignore the information; it just pays less attention to content that isn't near the beginning or end of the window.

It's worth noting that models like Claude can reliably retrieve a single fact from a 100K+ token window. The "needle in a haystack" benchmarks prove that. But there's a gap between finding one fact and reasoning across multiple related facts scattered through a large context. Multi-hop reasoning, connecting several pieces of information to reach a conclusion, is where performance degrades as the window fills with noise. The model has everything it needs. It just can't effectively synthesise across it all.

Why both approaches fail at the same point

RAG and agentic search differ in how much they retrieve, but they share a fundamental assumption: that the quality problem can be solved at the retrieval layer. Find the right documents, and the model handles the rest.

This assumption breaks down because retrieval is a search problem, and search is not the same as understanding. A search system can find relevant documents. What it can't do:

Resolve entity conflicts across source systems ("John Smith" in CRM vs "J. Smith" in billing)
Deduplicate information that appears in multiple documents in different forms
Structure hierarchical relationships (this policy overrides that one, this amendment supersedes that clause)
Order temporal sequences correctly (this happened before that, making the third event the most current)
Distinguish between canonical information and draft or outdated versions

These are data architecture problems, not search problems. They require a layer between the source systems and the context window that understands the shape of the data, not just its relevance.

The structured context layer

What we build at Token Theory is a structured context layer: an intermediate data architecture that sits between source systems and the LLM. It does the unglamorous work of entity resolution, deduplication, hierarchical structuring, and token optimisation before any data reaches the model.

The important distinction is that this layer isn't a retrieval system. It's a data product. It's maintained, versioned, and quality-controlled like a database, not generated on-the-fly like a search result. When a model needs context, it queries a pre-structured representation of the organisation's data rather than running a live search across raw source systems.

This trades retrieval flexibility for context quality. You can't ask arbitrary questions across all your data (that's what agentic search is for). But for the queries that matter, the ones your AI system handles hundreds of times a day, the context is precise, deduplicated, correctly ordered, and token-efficient.

Interested in working together?

Let's discuss what's possible for your organisation.

hello@tokentheory.ai

RAG and agentic search both have a context problem

The RAG chunking problem

The agentic search flooding problem

Why both approaches fail at the same point

The structured context layer

Further reading

Related articles

The context window is the new database

Agentic development changes the economics

Interested in working together?