Essay·May 2026

The million-token context was always a marketing number

Most long-context workloads still belong in a retrieval system. The narrow cases where the long window is worth the bill.

Updated May 25, 2026 · View changelog

Long context is useful. That's the concession, and million-token windows genuinely do matter for a narrow set of workflows. The trouble is that most teams deploying long context in 2026 are paying many times the cost of RAG for the same answer, because the workloads they point it at never needed synthesis across the full document in the first place.

200K query, Opus $1.00 Per request, no cache

Same query, RAG $0.06 Retrieved 4K tokens

Cost ratio 17× RAG vs long context

Retrieval drop 40% Above 500K tokens

Picture a 280,000-token document (about 200 pages of policy prose) and three questions of increasing specificity: a broad one, a precise one, and one that needs cross-section synthesis. Long context wins on the broad question and the synthesis question; retrieval matches it on the precise one. At list pricing, the long-context query runs about $1 of input tokens, while the retrieval query, with the right 4K tokens fetched, runs roughly $0.06. The math forces the architecture: retrieval for most of the volume, long context for the questions that depend on synthesis across distant parts of the source.

The million-token context is a useful but narrow capability. It pays for itself in a small set of workflows with specific properties, and it is wasted on most production workloads, where retrieval is cheaper and more accurate. The marketing sold the long window as a universal replacement when it is really a complementary tool, and this piece makes the case for that distinction. For the head-to-head model comparison, see context windows compared.

The early framing in the labs' marketing was that long-context retrieval is mostly solved on the frontier closed models. The needle-in-haystack benchmarks back that up, but the second-generation benchmarks and the document-scale workload reports tell a different story, with synthesis quality dropping sooner than the needle tests suggest. The labs' own documentation now treats long context as something you query rather than a buffer you summarize in one shot, which is the right framing.

What the benchmarks measured and what they missed

The benchmarks the labs used to demonstrate long-context capability were almost all needle-in-haystack tests. Drop a sentence into a long document, ask the model to find it. By 2025, the frontier models passed these tests at the limits of their context windows with near-perfect recall. The headline charts looked like the problem was solved.

It wasn't. The needle-in-haystack benchmark measures whether the model can find one fact in a large blob of text. What it leaves out is the harder work: synthesizing across distant parts of the blob, holding an argument together while reasoning about scattered evidence, producing a summary that doesn't quietly average across distinct claims. The second-generation long-context benchmarks (NVIDIA's RULER suite, the BABILong dataset, and the LongBench v3 suite) started measuring those harder properties and showed something less flattering. Synthesis quality drops clearly as the context grows, even when retrieval stays high. The advertised window sizes themselves come straight from the providers: 1M on Claude per Anthropic's API docs, 2M on Gemini per Google's models page. For more on why the benchmarks themselves stopped telling you anything, see that piece.

Whether the drop-off above 400K is fixable with better attention scaffolding or whether it is a hard limit of the current architecture is an open research question. The labs aren't committing in public.

The clean picture of the long-context capability: recall holds up well across the window, while synthesis degrades. A model will find the thing you buried in a long context, but reasoning across that context comes out weaker than the same model manages on shorter inputs.

Retrieval accuracy across context positions

Multi-fact synthesis test on Claude Opus, 600K-token document.

First 25% (start)

95%

20-40% range

88%

50-75% range

76%

75-100% range

58%

Last 5% (end)

42%

17× Cost ratio: long context vs retrieval for precise lookup

Cost numbers in this piece use May 2026 pricing per Anthropic's published rates. Earlier in 2026 the long-context-vs-RAG ratio was closer to 50×. Anthropic's April 2026 pricing adjustment narrowed it to roughly 17× for Opus at typical retrieval token counts. Still a wide gap, narrower than the launch-era math.

Where the long window is worth it

Two workflows where the long context is the right tool, and realistically the only one that works.

Exploratory document analysis. Load a document (a research paper, a regulatory filing, a long-form report) and ask iterative questions. What does the document say about X? Okay, where does it argue Y, and what evidence does it cite? Is there tension between the assumptions in chapter three and the conclusions in chapter twelve? That conversation is impossible against a properly-chunked retrieval system, because retrieval surfaces chunks independently and has no way for the model to notice that chapter three and chapter twelve are in conversation with each other.

Code understanding across a medium-sized codebase. Drop a folder of 50 to 200 files into the context, ask where is the right place to add a feature that does X, and the model reads the structure and produces an answer grounded in the actual code. That's the workflow that makes modern coding assistants useful. Without long context, the work needs manual file selection, which needs you to already know roughly what to look for.

Both workflows share one property: the question needs synthesis across distant parts of a single coherent body of text. Because retrieval breaks that body into independent chunks, it has no way to see the relationship — long context does, at a cost the workflow justifies.

A window big enough to hold the document isn't the same as a model that can reason across it. The bill is the same either way.

Where the long window is just expensive retrieval done badly

For precise lookup (find me the section about X, quote me the paragraph that says Y), retrieval wins on every dimension. It costs roughly two orders of magnitude less and returns faster, and the accuracy on the specific lookup is at least as good — often better, since the model is working with a tight context window instead of a long one. Long context can serve these queries too, but at twenty to a hundred times the price of running them against a vector store.

For high-volume Q&A against a fixed corpus (customer support knowledge bases, internal documentation queries, anything serving thousands of requests a day), long context is the wrong architecture. The per-query cost piles up quickly into unsustainable territory, and caching only blunts that, it doesn't fix it. This is retrieval's job.

For corpora past the context window of the largest available model (anything past 1M to 2M tokens, depending on the model), retrieval is mandatory. The long window doesn't stretch forever. Once the corpus is larger than the window, you don't have a choice to make.

The production decision

Most production AI systems today should default to retrieval. The cost dynamics force this for any workload with meaningful volume. The architectural simplicity makes it easier to debug, audit, and improve over time. The accuracy on specific lookups is at least as good as the long-context alternative. For the side-by-side cost math, see RAG vs fine-tuning.

Reach for long context second, specifically for the exploratory and cross-cutting questions retrieval can't answer well. The decision rule that holds up: if you can write the question down as a single sentence, use retrieval; if you have to read across the document to even know what to ask next, that's where long context earns its place.

The mistake the discourse keeps making is treating these two as rival architectures you have to pick between, when they're really complementary tools that handle different kinds of question. The best AI products use both, at different points in the same architecture, for the workloads each is best suited to.

1. Query

The user's actual question, often short.

↓

2. Retrieve relevant 4K

Vector store returns the most-relevant chunks.

↓

3. Send to model

Question plus retrieved context, ~6K tokens total.

↓

4. Answer

Grounded, citable, $0.06 per query.

Document types by cost-to-load and information density. Orange = retrieval-friendly. Black = long-context territory.

Not pretty, but it is how the marketing got framed.

One limit on this argument. The case rests on the dominant production-workload shapes the community has been describing in public: customer-facing knowledge-base Q&A, document analysis, and code understanding. For more specialized workloads — high-frequency signal extraction, or anything where the document changes faster than the prompt — the math shifts, and the right architecture may shift with it. Run your own numbers for your own use case.

Why the marketing said otherwise

The labs had structural reasons to oversell the long window. Retrieval is operationally complex. You need a vector database, an embedding model, a re-ranking pass, a content-chunking strategy, and a maintenance discipline. The long window promised to make all of that go away. The pitch was sticky because it appealed to anyone who didn't want to build the retrieval plumbing.

The labs also had commercial reasons. Long-context queries are expensive. A customer who replaces their RAG pipeline with a long-context loop pays the lab a lot more per query. The math worked for the labs even when it didn't work for the customers.

None of this is a conspiracy. It's just the normal pattern of a new capability being oversold during its first wave of marketing, with the corrective coming later, when people tried to use it as advertised and found it expensive in ways the marketing didn't flag.

The million-token context is a useful capability, but it complements retrieval rather than replacing it, handling a different kind of question at a cost the workflow has to justify. Treating the long window as a universal substitute is the most common architectural mistake in early 2026, and it is the one that produces the most surprising AI bills.

For your production system: default to retrieval and reach for long context only when the question is genuinely cross-cutting in a way retrieval cannot serve. At meaningful volume the cost dynamics make that the only sensible choice, and even when volume is low it remains the architecturally right one.

The labs will eventually correct their public messaging on this. The benchmarks are already moving toward measuring the synthesis-quality property that matters in practice, and the community is already articulating the complementary-tool framing. The marketing will catch up slowly. Until it does, the right move for anyone building on these tools is to ignore the headline window numbers and design the system around the workflows long context serves well.

Frequently asked

Is a million-token context window useful?

Yes, but for a narrow set of workflows. Exploratory document analysis and code understanding across a coherent codebase both benefit. Precise lookup queries do not, since RAG handles those for 1-2% of the cost.

Why is long context so expensive?

Every query pays for every token in the context, every time. A 200K-token document at Opus prices ($5 per million input) costs $1 per question. At 100 questions a day, that's $100 daily versus $6 with retrieval.

When should I use RAG instead of long context?

Always, unless your question requires synthesis across distant parts of a single document. RAG is cheaper, faster, more auditable, and the accuracy on specific lookups is equal or better.

Does caching make long context cheaper?

Yes. Anthropic and Google both offer prompt caching at ~10% of standard input rates. For repeated queries against the same document, caching brings long context within striking distance of RAG. For single-query workloads, it doesn't help.

What's the failure mode of long context at scale?

Synthesis quality drops as context length grows, even when retrieval recall stays high. Benchmarks like RULER and BABILong measure this. Frontier models lose 20-40% of their reasoning quality past 500K tokens.

Changelog

May 25, 2026 — Verified pricing against current provider documentation. Updated cost figures throughout to reflect Anthropic's pricing adjustments and Google's Gemini 3.1 Pro Preview rollout.
May 16, 2026 — Originally published.

References

Anthropic, "Claude API Documentation," docs.claude.com, accessed May 2026.
Anthropic, "Pricing," anthropic.com/pricing, accessed May 2026.
Google, "Gemini API models," ai.google.dev/gemini-api/docs/models, accessed May 2026.
NVIDIA, "RULER benchmark," github.com/NVIDIA/RULER, accessed May 2026.
"BABILong dataset," Hugging Face, huggingface.co/datasets/RMT-team/babilong, accessed May 2026.