benchr Issue No. 07

Claude Opus 4.7, reviewed

Coding, long-document analysis, and multilingual capability — what Opus 4.7's documentation and pricing imply for the workloads it fits, and where Sonnet 4.6 is the smarter spend.

· View changelog

Input cost / 1M $5.00 Per Anthropic's pricing page
SWE-Bench 87.6% on SWE-bench Verified Verified, 500-issue test set
Context window 1M ~600K effective retrieval zone
Output cost / 1M $25 vs $15 for Sonnet 4.6

Anthropic shipped Claude Opus 4.7 on April 16, 2026, per Anthropic's launch announcement. The benchmark chart didn't tell anyone much. Every frontier model now sits in the upper 90s on the tests that used to matter, and Opus 4.7 is no different. The model card and pricing structure tell more of the story than the benchmarks do.

This piece looks at the workloads Opus 4.7 is built for: long refactoring tasks in real codebases, summarization of long government documents, multilingual translation tasks with tricky tone requirements, and the half-dozen smaller jobs that come up along the way. The model is accessed through the Anthropic API, with calls written against Anthropic's Claude API documentation.

Short version: Opus 4.7 is the default for serious technical work right now, in my testing across coding, document analysis, and translation tasks. It has a few specific failure modes you'll learn to spot. The rest of this is the long version of that sentence.

One honest admission before the testing: I expected Opus 4.7 to feel like a marginal upgrade over Sonnet 4.6. The benchmark deltas are small. The pricing gap is real. The case for paying 1.67× more wasn't obvious going in. What changed my mind wasn't a benchmark — it was the architectural-taste calls Opus made on the refactor task. Hard to score, easy to feel.

The 1,200-line refactor

First deep test: a refactoring task from a real production codebase. A 1,200-line view model class in a big enterprise app, with the usual cruft you get when features ship faster than cleanup. The class mixed UI state, IO concerns, command dispatch, and a bit of business logic that belonged elsewhere. The prompt was simple. Read the file, find the smell, propose a refactor, write the new files.

Opus came back in under a minute with a three-file split. The first held UI state with no IO. The second held the commands. The third held the IO boundary, with explicit cancellation tokens threaded through the methods that needed them. The split was the one a senior engineer would have picked. Not a clever one. Just a correct one. That's harder to do reliably than the clever-sounding alternative.

The implementation that followed had three defects, all subtle. One method that should have stayed synchronous got marked async, which created a deadlock on the specific call path it lived on. One cancellation token got passed down but never awaited at the right point, which would've silently swallowed cancellations in production. And one private helper got inlined into its caller in a way that broke a unit test the model couldn't see.

None of these are deep failures of the model. They're the failures of a senior engineer with a partial view of the system, working from a single file. That's how you should think about Opus on code in 2026. Competent, fast, and limited to what you show it. Show it more of the codebase and most of the failures vanish. Show it less and they multiply.

Opus 4.7 vs Sonnet 4.6 — coding tasks

Score out of 100 across architecture decisions, bug-finding, refactor quality.

Opus 4.7 — architecture
95
Opus 4.7 — bug detection
92
Sonnet 4.6 (same tasks)
78

Two hundred pages, and a few wrong numbers

The long-context test was the Saudi Vision 2030 implementation report. 207 pages. Dense narrative, scattered numerical claims, and a structure that resists chunking. The whole thing fits inside Opus 4.7's million-token context window with room to spare. The prompt asked for a structured summary organized around the three pillars in the document, followed by specific factual lookups against the body.

The summary came back well-organized and mostly faithful. Two specific numbers were wrong. Both percentages. Both compressed the same way, which suggests the model averaged across nearby figures when the context got crowded. When asked to verify each number against the source, it corrected itself and named the page. So retrieval inside the long context was solid. The initial summarization compressed too aggressively.

This pattern shows up every time you use a long context as a one-shot summarization tool. The window is wide. The attention inside it is uneven. The fix is to treat long context as a queryable surface. Drop the document in, then ask specific questions. Don't ask for a one-shot summary on the first pass. Use the first pass to find things. Save the summary for pass two.

Think of Opus on code as a senior engineer. Competent, fast, and limited to what you've shown it.

(A side note that didn't make it into the main analysis: at one point during the long-document testing, I tried feeding Opus a 400-page PDF as a single context — past the reliable retrieval zone — to see how it would fail. It didn't fail cleanly. It produced a confident summary that mixed claims from pages 30 and 280 into a single fabricated sentence. The failure mode at the edge of the context window is worth being aware of even when you stay inside the reliable zone.)

Multilingual under pressure

Third deep test: a translation job that doesn't show up in standard benchmarks. Take a 600-word English marketing page and put it into Modern Standard Arabic that sounds natural to a Gulf reader and keeps the brand voice consistent across paragraphs. This one is hard. The right Arabic for a young Saudi audience isn't the right Arabic for a Levantine audience, and most models pick one style or the other no matter what you ask for.

Opus produced a translation that was 85% to 90% shippable on the first try. The grammar was right. The tone was nearly right. Technical terms stayed in Latin script where a Gulf reader would expect them. The mistakes were specific and easy to fix. A handful of words read as Egyptian, which would mark the text as foreign to a Khaleeji reader. When asked to revise with that constraint spelled out, the model came back with output that needed only light editing.

The same prompt on GPT-5 produced text with way more Egyptian-flavored vocabulary that resisted correction. Gemini 3.5 Flash returned a draft that stuck to MSA even when the prompt asked for dialect. Opus did the right thing on the first try more often than either, and the right thing on the second try every time.

The limit of this evaluation: one reviewer can't fairly judge dialectal Arabic across the whole Arabic-speaking world. Gulf Arabic speakers read the translation and called it shippable. It wasn't tested on Egyptian, Maghrebi, or Levantine readers, who would probably score it differently.

Make of that what you will. The bigger pattern is that Opus's failures cluster in places where the model is operating at the edge of its strongest capabilities, not in the middle. That's a different failure profile than weaker models, which fail in the middle.

The failure modes that recur

Three failure modes show up often enough in working sessions to call out.

First: over-explanation. Ask Opus a yes-or-no factual question and you'll often get the right answer followed by four paragraphs of caveats. That's a usability issue, not a capability one, but it slows down the rapid back-and-forth that makes a working session productive. Prefacing factual questions with "one-line answer, then stop" works. You shouldn't have to.

Second: believable API hallucination in less-popular libraries. The model is reliable on the major standard libraries of the major languages. Python's standard library, .NET's core APIs, the standard browser APIs, anything from the busy parts of npm. Move into a niche library or a less-popular framework and the hit rate drops without the model flagging its own uncertainty. The defense is simple. Never trust an API signature you can't verify. That's good practice anyway, but the missing warning signal is a defect.

Third: helpful drift. Ask Opus to refactor one method in a file and it'll sometimes quietly refactor a nearby method it judged in need of attention. Sometimes that's welcome. Sometimes the second method was fine and now you're stuck reviewing it in your diff. The fix is to spell out the scope at the start of the request. The default behavior over-reaches.

What the sessions cost

Opus 4.7 lists at $5 per million input tokens and $25 per million output tokens, per Anthropic's published pricing. That sounds expensive until you do the math on a typical working session.

Estimated session costs at Opus 4.7 list pricing, May 2026
WorkloadTokens (in/out)Cost per sessionFrequency
Focused coding session40k / 6k$0.353–5× per day
Long-doc analysis (one PDF)180k / 4k$1.002–3× per week
MSA translation (per page)2k / 1.5k$0.05weekly
Quick factual chat1k / 0.5k$0.018many times daily

For a single engineer running a mixed workload of coding, document analysis, and translation, the monthly bill at list pricing typically lands between $25 and $50. The same volume on Claude Sonnet 4.6 instead would run roughly 60% of that. Sonnet is good enough for most of the work. Opus pays its premium on tasks where architectural reasoning matters and where a wrong answer would cost more in review time than the extra model fee.

Knowing when to drop from Opus to Sonnet is the biggest pricing decision you'll make as a developer in 2026. The model can't make it for you. It has no idea what a wrong answer costs you downstream.

1.67× More expensive than Sonnet, for ~20% better output

Go with Sonnet for the easy stuff. Pay for Opus when the reasoning carries the work.

Whether the latency improvement in this version will hold once Anthropic hits production scale on the new tokenizer — I don't know. It might. It might not. Three weeks isn't enough to tell.

Coding

95 /100 architecture taste

Reasoning

96 /100 multi-step

Long context

94 /100 1M window

Vision

82 /100 weaker spot

Multilingual

88 /100 strong Arabic

Writing

90 /100 default tone
COST PER TOKEN → QUALITY ↑ Opus 4.7 Sonnet 4.6 Haiku 4.5 GPT-5 Gemini 3.1 Pro Preview
Quality vs. cost across the frontier and mid-tier. Opus sits top-right: best, and priciest.

Claude Opus 4.7 is the default for serious technical work right now, and the case for it is structural. On the architectural-taste tasks I ran, it had a sense the other frontier models still lacked. It hedges in the right places and commits in the right places. It writes natural code in the major languages, treats long contexts as a queryable surface (not a summarization black box), and produces multilingual output you can finish in a single editing pass.

The case against it is narrower. The over-explanation, the helpful drift, the believable API hallucination in obscure corners. These are defects you'll learn to route around. None of them disqualifies the model for the work it's best at. If anything, they reward a bit of prompt discipline that pays off on every other model too.

The serious comparison is GPT-5, and that gets its own piece. For now, if you're writing software, processing long documents, or producing content in more than one language, Opus 4.7 is what you pay for. The premium is real. So is what you get for it. If you're shipping software in production, Opus 4.7 is the call.

Bottom line

Pay for Claude Opus 4.7 if your work involves complex reasoning, production coding, long-document analysis, or multilingual content where tone matters. Drop to Sonnet 4.6 for routine work — chat, summaries, simple coding. The 5× price gap between Sonnet and Opus is only worth it when the cost of a wrong answer exceeds the model fee. Most teams over-buy on Opus and under-spend on Sonnet.

Frequently asked

Is Claude Opus 4.7 worth $5 per million input tokens?

For technical work — coding, document analysis, complex reasoning — yes. The output quality and edge-case handling justify the premium over Sonnet 4.6 ($3 per million input). For simple chat or content generation, Sonnet is the better deal at about 60% of the price.

How does Claude Opus 4.7 compare to GPT-5?

Opus wins on reasoning under uncertainty, code refactoring, and long-document analysis. GPT-5 wins on visual design tasks and structured output. For mixed workloads, running both models lands ahead of either alone.

What does Claude Opus 4.7 cost to use daily?

At $5 per million input tokens and $25 per million output, a mixed workload of coding, document analysis, and translation typically runs $25–$50 per month for a single engineer. Heavy users with constant long-document analysis can hit $70 to $100 a month. Light users stay under $10.

When should you use Sonnet 4.6 instead of Opus 4.7?

Routine work: chat, summarization, draft generation, simple coding tasks. Sonnet is 60% the price and handles most of these well enough that Opus's edge isn't worth the spend.

What's the effective context window for Claude Opus 4.7?

Advertised at 1M tokens. Retrieval stays reliable to about 600K tokens, then degrades. For document-scale work, plan around the 600K effective zone, not the 1M ceiling.

Changelog

  • May 25, 2026 — Verified pricing against current provider documentation. Updated cost figures throughout to reflect Anthropic's pricing adjustments and Google's Gemini 3.1 Pro Preview rollout.
  • May 4, 2026 — Added Sonnet 4.6 comparison section after readers asked for the cross-tier math.
  • April 22, 2026 — Originally published.

References

  1. Anthropic, "Claude API Documentation," docs.claude.com, accessed May 2026.
  2. Anthropic, "Claude Pricing," anthropic.com/pricing, accessed May 2026.
  3. LMSYS, "Chatbot Arena leaderboard," lmarena.ai, May 2026 snapshot.
  4. "SWE-bench Verified leaderboard," swebench.com, May 2026.
  5. Anthropic, "Introducing Claude Opus 4.7," anthropic.com/news/claude-opus-4-7, April 16, 2026.