Anthropic shipped Claude Opus 4.7 on April 16, 2026, per Anthropic's launch announcement. The benchmark chart didn't tell anyone much. Every frontier model now sits in the upper 90s on the tests that used to matter, and Opus 4.7 is no different. The model card and pricing structure tell more of the story than the benchmarks do.
This piece looks at the workloads Opus 4.7 is built for: long refactoring tasks in real codebases, summarization of long government documents, multilingual translation tasks with tricky tone requirements, and the half-dozen smaller jobs that come up along the way. The model is accessed through the Anthropic API, with calls written against Anthropic's Claude API documentation.
Short version: Opus 4.7 is the default for serious technical work right now, in my testing across coding, document analysis, and translation tasks. It has a few specific failure modes you'll learn to spot. The rest of this is the long version of that sentence.
One honest admission before the testing: I expected Opus 4.7 to feel like a marginal upgrade over Sonnet 4.6. The benchmark deltas are small. The pricing gap is real. The case for paying 1.67× more wasn't obvious going in. What changed my mind wasn't a benchmark — it was the architectural-taste calls Opus made on the refactor task. Hard to score, easy to feel.
The 1,200-line refactor
First deep test: a refactoring task from a real production codebase. A 1,200-line view model class in a big enterprise app, with the usual cruft you get when features ship faster than cleanup. The class mixed UI state, IO concerns, command dispatch, and a bit of business logic that belonged elsewhere. The prompt was simple. Read the file, find the smell, propose a refactor, write the new files.
Opus came back in under a minute with a three-file split. The first held UI state with no IO. The second held the commands. The third held the IO boundary, with explicit cancellation tokens threaded through the methods that needed them. The split was the one a senior engineer would have picked. Not a clever one. Just a correct one. That's harder to do reliably than the clever-sounding alternative.
The implementation that followed had three defects, all subtle. One method that should have stayed synchronous got marked async, which created a deadlock on the specific call path it lived on. One cancellation token got passed down but never awaited at the right point, which would've silently swallowed cancellations in production. And one private helper got inlined into its caller in a way that broke a unit test the model couldn't see.
None of these are deep failures of the model. They're the failures of a senior engineer with a partial view of the system, working from a single file. That's how you should think about Opus on code in 2026. Competent, fast, and limited to what you show it. Show it more of the codebase and most of the failures vanish. Show it less and they multiply.
Two hundred pages, and a few wrong numbers
The long-context test was the Saudi Vision 2030 implementation report. 207 pages. Dense narrative, scattered numerical claims, and a structure that resists chunking. The whole thing fits inside Opus 4.7's million-token context window with room to spare. The prompt asked for a structured summary organized around the three pillars in the document, followed by specific factual lookups against the body.
The summary came back well-organized and mostly faithful. Two specific numbers were wrong. Both percentages. Both compressed the same way, which suggests the model averaged across nearby figures when the context got crowded. When asked to verify each number against the source, it corrected itself and named the page. So retrieval inside the long context was solid. The initial summarization compressed too aggressively.
This pattern shows up every time you use a long context as a one-shot summarization tool. The window is wide. The attention inside it is uneven. The fix is to treat long context as a queryable surface. Drop the document in, then ask specific questions. Don't ask for a one-shot summary on the first pass. Use the first pass to find things. Save the summary for pass two.
Think of Opus on code as a senior engineer. Competent, fast, and limited to what you've shown it.
(A side note that didn't make it into the main analysis: at one point during the long-document testing, I tried feeding Opus a 400-page PDF as a single context — past the reliable retrieval zone — to see how it would fail. It didn't fail cleanly. It produced a confident summary that mixed claims from pages 30 and 280 into a single fabricated sentence. The failure mode at the edge of the context window is worth being aware of even when you stay inside the reliable zone.)
Multilingual under pressure
Third deep test: a translation job that doesn't show up in standard benchmarks. Take a 600-word English marketing page and put it into Modern Standard Arabic that sounds natural to a Gulf reader and keeps the brand voice consistent across paragraphs. This one is hard. The right Arabic for a young Saudi audience isn't the right Arabic for a Levantine audience, and most models pick one style or the other no matter what you ask for.
Opus produced a translation that was 85% to 90% shippable on the first try. The grammar was right. The tone was nearly right. Technical terms stayed in Latin script where a Gulf reader would expect them. The mistakes were specific and easy to fix. A handful of words read as Egyptian, which would mark the text as foreign to a Khaleeji reader. When asked to revise with that constraint spelled out, the model came back with output that needed only light editing.
The same prompt on GPT-5 produced text with way more Egyptian-flavored vocabulary that resisted correction. Gemini 3.5 Flash returned a draft that stuck to MSA even when the prompt asked for dialect. Opus did the right thing on the first try more often than either, and the right thing on the second try every time.
The limit of this evaluation: one reviewer can't fairly judge dialectal Arabic across the whole Arabic-speaking world. Gulf Arabic speakers read the translation and called it shippable. It wasn't tested on Egyptian, Maghrebi, or Levantine readers, who would probably score it differently.
Make of that what you will. The bigger pattern is that Opus's failures cluster in places where the model is operating at the edge of its strongest capabilities, not in the middle. That's a different failure profile than weaker models, which fail in the middle.
The failure modes that recur
Three failure modes show up often enough in working sessions to call out.
First: over-explanation. Ask Opus a yes-or-no factual question and you'll often get the right answer followed by four paragraphs of caveats. That's a usability issue, not a capability one, but it slows down the rapid back-and-forth that makes a working session productive. Prefacing factual questions with "one-line answer, then stop" works. You shouldn't have to.
Second: believable API hallucination in less-popular libraries. The model is reliable on the major standard libraries of the major languages. Python's standard library, .NET's core APIs, the standard browser APIs, anything from the busy parts of npm. Move into a niche library or a less-popular framework and the hit rate drops without the model flagging its own uncertainty. The defense is simple. Never trust an API signature you can't verify. That's good practice anyway, but the missing warning signal is a defect.
Third: helpful drift. Ask Opus to refactor one method in a file and it'll sometimes quietly refactor a nearby method it judged in need of attention. Sometimes that's welcome. Sometimes the second method was fine and now you're stuck reviewing it in your diff. The fix is to spell out the scope at the start of the request. The default behavior over-reaches.
What the sessions cost
Opus 4.7 lists at $5 per million input tokens and $25 per million output tokens, per Anthropic's published pricing. That sounds expensive until you do the math on a typical working session.
| Workload | Tokens (in/out) | Cost per session | Frequency |
|---|---|---|---|
| Focused coding session | 40k / 6k | $0.35 | 3–5× per day |
| Long-doc analysis (one PDF) | 180k / 4k | $1.00 | 2–3× per week |
| MSA translation (per page) | 2k / 1.5k | $0.05 | weekly |
| Quick factual chat | 1k / 0.5k | $0.018 | many times daily |
For a single engineer running a mixed workload of coding, document analysis, and translation, the monthly bill at list pricing typically lands between $25 and $50. The same volume on Claude Sonnet 4.6 instead would run roughly 60% of that. Sonnet is good enough for most of the work. Opus pays its premium on tasks where architectural reasoning matters and where a wrong answer would cost more in review time than the extra model fee.
Knowing when to drop from Opus to Sonnet is the biggest pricing decision you'll make as a developer in 2026. The model can't make it for you. It has no idea what a wrong answer costs you downstream.
Go with Sonnet for the easy stuff. Pay for Opus when the reasoning carries the work.
Whether the latency improvement in this version will hold once Anthropic hits production scale on the new tokenizer — I don't know. It might. It might not. Three weeks isn't enough to tell.
Coding
95 /100 architecture tasteReasoning
96 /100 multi-stepLong context
94 /100 1M windowVision
82 /100 weaker spotMultilingual
88 /100 strong ArabicWriting
90 /100 default toneClaude Opus 4.7 is the default for serious technical work right now, and the case for it is structural. On the architectural-taste tasks I ran, it had a sense the other frontier models still lacked. It hedges in the right places and commits in the right places. It writes natural code in the major languages, treats long contexts as a queryable surface (not a summarization black box), and produces multilingual output you can finish in a single editing pass.
The case against it is narrower. The over-explanation, the helpful drift, the believable API hallucination in obscure corners. These are defects you'll learn to route around. None of them disqualifies the model for the work it's best at. If anything, they reward a bit of prompt discipline that pays off on every other model too.
The serious comparison is GPT-5, and that gets its own piece. For now, if you're writing software, processing long documents, or producing content in more than one language, Opus 4.7 is what you pay for. The premium is real. So is what you get for it. If you're shipping software in production, Opus 4.7 is the call.