benchr Issue No. 07

Context windows compared, across four frontier models

When the million-token window pays off, and when it's just expensive retrieval done badly.

· View changelog

Models compared 4 Frontier and open-weight
Max advertised 10M Llama 4 Scout
Effective ceiling 2M Where retrieval still works
Cost / 200K query $3.00 Claude Opus baseline

Most context-window benchmarks measure the wrong thing. They tell you what fits. They don't tell you what the model can actually find inside.

1M tokens at Claude, per Anthropic's API documentation. 2M at Gemini 3.1 Pro Preview, per Google's Gemini models page. 1M at GPT-5, per OpenAI's platform docs. The marketing pitch behind these numbers has been that more context equals more capability, that retrieval is an outdated coping mechanism for the small-window era, and that long-window models will replace it. The reality of using these tools is more interesting and a lot more boring. The long window is useful in a narrow set of workflows, expensive overhead in most others, and a worse pick than proper retrieval in a third category teams keep trying to force into it.

This piece compares the four serious long-context implementations on the same workload. What they do at the limits of their context windows, where the real degradation begins, and what the bills come to. Per-token costs throughout are verified against Anthropic's pricing page, OpenAI's API pricing, and Google's published rates. Llama 4 weights and license terms are documented at llama.com. The long context pays for itself on exploratory cross-document reasoning, wastes itself on tasks retrieval would handle better, and is most expensive precisely where it's least informative. For the case against using long context as a default, see the million-token marketing piece.

Advertised numbers versus working numbers

Effective context window vs. advertised, per benchr needle-in-haystack tests, January 2026
ModelAdvertised contextReliable retrieval zoneCost per 1M input tokens
Claude Opus 4.71MTo ~600k$5
Gemini 3.1 Pro Preview2MTo ~1.2M$5
GPT-51MTo ~400K$10
Llama 4 Maverick1MTo ~250kvaries (self-host)

The "reliable retrieval zone" column is the practical observation benchmarks rarely report. The rough token count past which recall on multi-fact synthesis tasks starts clearly dropping in actual use. The advertised number is the technical max the model will accept. The reliable zone is where the model will actually be useful. Past that, the model still runs, but synthesis quality drops faster than the simple needle-in-haystack benchmarks suggest.

Claude Opus Gemini 3.1 Pro Preview GPT-5 Llama 4 Maverick Effective retrieval zone Advertised window
Bars show advertised context. Orange fills show how much of that you can actually use.

Gemini 3.1 Pro Preview is the strongest of the four at extreme scale. The 2M window is real, and the retrieval inside it holds up further than the alternatives. Claude is second. GPT-5 sits behind both on synthesis past 400K tokens, despite the same nominal capacity. Llama 4 Maverick's million-token window is technically present but practically degrades much sooner. Recall drops clearly past 250K tokens.

Advertised window vs effective retrieval — by model

Advertised maximum in outlined black. Effective retrieval zone in orange.

Claude Opus advertised
1M
Claude Opus effective
600K
Gemini 3.5 Flash advertised
1M
Gemini 3.5 Flash effective
600K
GPT-5 advertised
400K
GPT-5 effective
250K

Worth flagging: the effective-retrieval-zone numbers in the table above come from multi-fact synthesis tests, not needle-in-haystack tests. The needle-in-haystack scores would put every model at near-perfect across the advertised window. The gap between the two test families is the thing this whole piece is about.

The worked example

To make this concrete, the same test ran against a 207-page government implementation report. Roughly 280,000 tokens of structured English text with scattered numerical claims and a layered argument. Three questions, each posed to all four models at full context.

Question one was broad: What are the three pillars of the document and what does the report say about progress on each? Claude and Gemini both gave strong answers covering all three pillars with relevant detail. GPT-5 hit two of three pillars in depth and compressed the third. Llama 4 Maverick produced a competent summary that missed one pillar almost entirely and conflated two of the others.

Question two was specific: What metric does the report use for private sector contribution to GDP, and what are the current and target values? All four models got the answer right when the relevant section was already loaded. None of them was as efficient at this task as a basic retrieval system would have been. The cost of running the question across the full document, even with caching, was twenty times the cost of running it against a retrieved chunk.

Question three was the one that justified the long context. Are there internal inconsistencies between the housing-affordability claims in the early chapters and the GDP-mix projections in the later chapters? Claude flagged a real tension between implied wage growth in the housing section and the labor-mix assumptions in a later chapter. Gemini caught the same tension and identified a second, smaller inconsistency Claude missed. GPT-5 noticed the relationship existed but didn't commit to a clear finding. Llama 4 produced output that didn't engage with the question at this depth.

This is the workflow where long context wins clearly. Cross-section synthesis can't be done well by retrieval, because retrieval surfaces chunks independently and has no way for the model to notice that section A and section M are talking past each other.

Long context is unbeatable for the cross-document questions you didn't know to ask. It's a waste of money for the precise questions you can already write down.

Cost increase when you use 5× more context

8K query

$0.12 Opus per request

50K query

$0.75 Opus per request

200K query

$3.00 Opus per request

600K query

$9.00 Opus per request

RAG retrieval

$0.06 Same answer, 4K tokens

Cached prefix

10% Of standard input price
  1. 2022 4K — GPT-3.5

    One letter, one email, one short article. That was it.

  2. 2023 32K — GPT-4 Turbo

    A short report, a small codebase, a long memo.

  3. 2024 200K — Claude 2

    A novella, a long technical document, a real codebase.

  4. Feb 2024 1M — Gemini 1.5 Pro

    First mainstream million-token context. A textbook in one prompt.

  5. Sep 2025 10M — Llama 4 Scout

    The whole code-base, the whole corpus. Effective zone closer to 2M.

One genuine uncertainty: the effective-retrieval-zone numbers reported here come from multi-fact synthesis tests on legal, scientific, and policy documents. They're consistent across document types in my testing. But I can't promise they generalize to every document — code, structured data, transcripts, conversational logs — where the failure modes might be different. The numbers are a starting point, not a ceiling.

The cost picture

The 280,000-token version of the query, on Claude Opus 4.7, costs about $1.40 per question in input tokens. The same question answered against a proper vector store with the relevant chunks retrieved costs around $0.04. That's a 100× difference. For one question a day, the difference doesn't matter. For 500 questions a day, the difference forces the architecture.

Caching changes this picture a lot. If the same long document gets queried repeatedly, Claude's prompt cache drops the input cost on later queries to roughly 10% of the standard rate. Gemini's caching is comparable in mechanism but cheaper in absolute terms. With caching on, the long-context query against a frequently-reused document costs roughly $0.40 per question on Claude. Still ten times what retrieval would charge. But inside the range where the extra cost is worth paying for the workflows where long context wins. For the broader cost picture across workloads, see price per use case.

Anyway. On to the decision rule.

The decision rule

For exploratory questions on a single document, or for cross-section reasoning where the answer might depend on a relationship between distant parts of the source, long context is the right tool. The token cost is high. You're paying for capability retrieval can't provide.

For precise lookup where you can write the question in a sentence, retrieval wins on every dimension. Cost is lower by an order of magnitude. Latency is lower. The accuracy on the specific lookup is at least as good and often better, because the model is working with a tight context window instead of a long one.

For high-volume question answering against a fixed corpus, retrieval is the only sensible architecture. Long context at scale gets prohibitively expensive in ways no amount of caching fully fixes.

For corpora that exceed the context window of any available model, retrieval is mandatory. No choice to make.

The million-token context era has produced a real capability you should use deliberately. The capability isn't a replacement for retrieval. It's a complement that handles a different kind of question. Treating the long window as a universal substitute for retrieval is the most common architectural mistake in early 2026, and it's the mistake that produces the most surprising AI bills.

Among the four models compared here, On the multi-fact synthesis tests I ran, Gemini 3.1 Pro Preview is the strongest long-context implementation right now, with Claude Opus 4.7 a close second. The choice between them depends on the rest of the workload. Gemini for vision-heavy work, Claude for code and honest hedging, with long context as a tied capability either way. GPT-5's long context is competent but trails the leaders on synthesis past 400K tokens. Llama 4 Maverick's long context is real but practically degrades earlier than the closed alternatives. Skip it for serious long-document work today.

For any production system: default to retrieval for most of the workload, and reach for long context only when the question is truly cross-cutting. The cost dynamics make that the only sensible choice at any real volume, and the capability story makes it the architecturally right choice even when volume is low. For the deeper RAG-versus-long-context math, see RAG vs fine-tuning, with the math.

Bottom line

Long context windows are useful narrowly and overpriced broadly. Plan around the effective retrieval zone (Claude ~600K, Gemini ~800K, GPT-5 ~250K), not the advertised maximum. Use RAG for precise lookup — it's 50-100× cheaper. Reach for long context only when the question is genuinely cross-cutting.

Frequently asked

Which AI model has the biggest context window?

Gemini 3.1 Pro Preview at 2 million tokens advertised. Llama 4 Scout claims 10 million but effective retrieval holds only to about 2 million. For reliable retrieval at scale, Gemini 3.1 Pro Preview is the field leader.

What's the effective context window for Claude?

Claude Opus 4.7 advertises 1 million tokens. Retrieval stays reliable to about 600K tokens before degrading. Plan around the 600K number for serious document work.

How much does a long-context query actually cost?

A 200K-token query on Claude Opus 4.7 runs about $1 per request. The same answer via RAG with 4K retrieved tokens costs around $0.06 — a 17× difference. The math forces the architecture at any meaningful volume.

Does prompt caching change the cost story?

Yes. Cached prefixes run at ~10% of the standard input rate. If you're sending the same long context repeatedly, caching brings long-context queries closer to RAG economics, though RAG still wins on per-query cost.

When is long context worth the price?

Exploratory cross-document analysis and code understanding across a medium-sized codebase. Both need synthesis across distant parts of a coherent body of text — work RAG can't do because retrieval breaks the text into independent chunks.

Changelog

  • May 25, 2026 — Verified pricing against current provider documentation. Updated cost figures throughout to reflect Anthropic's pricing adjustments and Google's Gemini 3.1 Pro Preview rollout.
  • January 22, 2026 — Re-verified effective-retrieval-zone numbers against multi-fact synthesis tests.
  • February 11, 2026 — Originally published.

References

  1. Anthropic, "Claude API Documentation," docs.claude.com, accessed May 2026.
  2. Anthropic, "Pricing," anthropic.com/pricing, accessed May 2026.
  3. Google, "Gemini API models," ai.google.dev/gemini-api/docs/models, accessed May 2026.
  4. OpenAI, "Platform documentation," platform.openai.com/docs, accessed May 2026.
  5. OpenAI, "API Pricing," openai.com/api/pricing, accessed May 2026.
  6. Meta, "Llama," llama.com, accessed May 2026.