GPT-5 vs Claude Opus 4.7: seven workloads, scored

Seven workload categories where the two labs have made different bets. The scoreboard, and which model wins which kind of work.

· View changelog

Final tally 5–2 Claude wins five outright
Opus SWE-bench Verified 87.6% Anthropic-reported
GPT-5 SWE-bench Verified 74.9% OpenAI-reported
Input price spread $5 vs $1.25 Opus / GPT-5 per 1M tokens

This piece compares the two frontier models you're most likely to be deciding between in 2026 across seven workload categories: a code refactor, a marketing landing page, reasoning under uncertainty, an instruction-bound recipe task, a paper summary, a difficult customer email, and a Python debug. The verdict for each category is grounded in the public benchmark record (SWE-bench Verified, LMArena), each lab's own positioning of its model, and the consistent public discussion of how each tier handles each category. The headline: Opus 4.7 takes five, GPT-5 takes one decisively, and there is one tie.

The pricing reference, since both come up in every workload decision: Claude Opus 4.7 lists at $5 per million input tokens and $25 output, per Anthropic's pricing page. GPT-5 lists at $1.25 and $10 per OpenAI's API pricing. For the full cost picture across workloads, see price per use case. For each model on its own terms, see the Opus review and the GPT-5 review.

One framing note before the seven categories: treat this as a directional comparison rather than an exhaustive evaluation. A different seven categories would reorder some of the verdicts. The 5-2 is a useful summary of where the two labs have aimed their models, and it's worth holding loosely.

Category one: refactor a class hierarchy

The workload: take a production class with five derived types and crosscutting concerns, name the architectural smell, propose a refactor, and produce the new files. This is the kind of work the SWE-bench Verified benchmark is designed to score. Anthropic reports Opus 4.7 at 87.6% on the verified subset, against OpenAI's reported figure for GPT-5 around 74.9%. A gap that wide lines up with the consistent community discussion of where each model lands on production refactoring work.

What you should expect from each model on this kind of task, based on the public record: Opus tends to name the actual architectural smell (the leaky base-class surface, not the inheritance) and propose a split that reads like the call a senior engineer would've made. GPT-5 produces correct code that is more verbose and tends to add scaffolding the prompt did not ask for. Both ship working solutions. Opus's needs less cleanup before it lands in your codebase.

Winner: Claude, with the gap consistent with the SWE-bench rank order.

Category two: write a marketing landing page

The workload: produce a single-page marketing site for a young technical audience. Vanilla HTML and CSS, mobile-first, bold typographic hierarchy. OpenAI's positioning for GPT-5 explicitly emphasizes visual design and structured output as model strengths. The public community discussion of both models has converged on the same pattern: GPT-5's defaults look more contemporary, Opus's defaults look like an enterprise SaaS dashboard. The same prompt asked of both will give you a more shippable layout out of GPT-5 every time.

The specifics behind the pattern: GPT-5 leans on a single confident accent color, oversize hero typography with tight letter-spacing, and quiet asymmetric grids. Opus produces a more cautious palette and a more conventional grid. Ask Opus to revise toward something bolder and you can feel the model trying, where GPT-5 arrives bold without being asked. That is the visual sensibility you are paying OpenAI for, and on this category it earns the premium.

Winner: GPT-5, and the margin is wide.

Category three: reasoning under uncertainty

The workload: a specific regulatory or policy question where the right answer needs familiarity with both the underlying statute and the trajectory of its enforcement. A good answer names the relevant article, separates the statute from how it actually gets enforced, and is honest about the limits of what a non-specialist can assert.

This is the workload Anthropic positions Opus on most heavily. The lab's release material talks about "hedging in the right places" — flagging uncertainty when the model's at the edge of its competence. The public discussion across Anthropic's forum and the broader research community is consistent: Opus produces a careful answer that separates what it is sure of from what you should still run past a human expert. GPT-5's default tone on the same question is confident, whether or not that confidence is earned.

The failure pattern you should plan around with GPT-5 here: a nearly-correct answer with one citation that is wrong in a way a non-specialist reader will not catch. That is the worst kind of failure for this workload, because it survives a casual review. Opus's instinct to hedge is what guards against that class of error.

Winner: Claude. The hedging is part of the value.

Category four: a constrained instruction-following task

The workload: an everyday task with a quantitative constraint, such as a time budget or a recipe built from a fixed set of ingredients. The interesting question is which model respects the constraint without adding scope.

Both models handle the substance fine. The pattern that recurs across the public community discussion is that GPT-5 tends to add optional flourishes — an extra step, a note on substitutions — that push it over the constraint. Opus respects the constraint by default and asks before adding scope. When you want a quick task done inside the budget you specified, that is the right behavior. When you want a richer answer that explores the space, GPT-5's tendency to elaborate is the feature.

Edge to Claude on constraint compliance. Call it a tie if you value the elaboration.

Category five: summarize a long technical paper

The workload: take a 60-page technical paper and produce a 1500-word summary for an engineer who knows the basics but has not read it. There are two reasonable ways to structure this: by section (walk the paper in order) or by claim (identify the contributions and pull the supporting experiments into each one).

The pattern across the public community discussion: Opus tends to structure by claim, GPT-5 by section. The right answer is whichever structure fits the audience the prompt named. An engineer who mostly wants the takeaways is better served claim-first; an academic reviewer who wants section-by-section coverage is better served the other way. So the category goes to whichever model's default matches your reader, and for the prompt above — engineer, what to take from it — that is Opus.

The secondary risk on long technical summaries, across both models, is that the experimental section gets compressed to one sentence when its substance carries the paper's broader argument. GPT-5 is somewhat more prone to that on complex experimental sections, while Opus is more likely to err toward dry prose. Pick your trade.

Winner: Claude when the audience is "engineer, what to take from it."

Category six: a difficult customer email

The workload: a real-feeling customer-service scenario. A paying customer is upset. A bug took longer than it should have to fix. A downstream side effect hit something the customer cared about. Draft a reply that takes responsibility, explains the situation without excuses, offers a concrete remedy, and reads like a human wrote it.

GPT-5 has a reputation for warm prose. On this specific workload, the pattern across the public community discussion is that Opus produces a draft that reads like a person wrote it rather than a brand. GPT-5's defaults lean toward corporate apology language ("we sincerely apologize for the inconvenience"). Opus's defaults lean toward direct, plain-language apology ("I owe you an apology, and an explanation that doesn't try to dodge any of this"). The latter is what a small team should send.

GPT-5 will get there on a second pass when you spell out the tone constraint, where Opus tends to land it first try. On a workload where tone is the deliverable, that first-attempt difference is what you are paying for.

Winner: Claude. Tone discipline is the work here.

Category seven: debug a broken Python script

The workload: a 140-line Python script with four deliberately-introduced bugs. Three obvious. One subtle (an async ordering bug that only surfaces under specific call patterns). The interesting question is which model flags the subtle one without being asked.

This is the category most directly tied to SWE-bench Verified, and the rank order tracks the benchmark. Opus's documented behavior is to flag ambiguous code as a question ("is this the intended choice?") rather than silently overwriting. That instinct is what catches subtle production bugs on the first pass. GPT-5 is more likely to fix the three obvious bugs cleanly and miss the fourth, then identify it correctly only when prompted to look at the async section again.

First-pass behavior is what matters here, because in a production debugging workflow you usually do not know what you are missing. Opus's instinct to ask about ambiguous code is what stops the bug that would otherwise ship quietly.

Winner: Claude, by the bug it bothered to flag.

Seven workloads, qualitative verdicts

Strength of fit for each model per workload category, scaled 0–100. Aubergine: Claude Opus 4.7. Outlined: GPT-5.

Refactor: Claude
Strong
Refactor: GPT-5
OK
Landing page: Claude
Weak
Landing page: GPT-5
Strong
Debug script: Claude
Strong
Debug script: GPT-5
OK
5–2 Claude wins on five of seven workload categories

The scoreboard, in prose

Pulling the seven verdicts together: Claude takes the class-hierarchy refactor on architectural taste, the regulatory question on honest hedging, the constrained recipe on instruction compliance, the paper summary on structure choice for the named audience, the difficult email on tone work, and the Python debug by flagging the subtle async bug. GPT-5 takes the marketing landing page decisively on visual sensibility. Seven categories, five-two on the scoreboard, with two of Claude's wins narrow enough that a different sample could move them.

STYLE → CORRECTNESS ↑ Claude Opus 4.7 GPT-5 Gemini 3.1 Pro Preview
Two axes, three models. Claude leans correctness. GPT-5 leans style. Pick the corner that matches your work.

The scoreboard looks one-sided. The lived experience is closer, because the one category where GPT-5 wins matters a lot to some readers.

Refactor

Claude Architectural taste

Landing page

GPT-5 Visual sensibility

Reasoning

Claude Honest hedging

Recipe

Claude Constraint compliance

Paper summary

Claude Audience fit

Hard email

Claude Tone work

Debug script

Claude Flagged the subtle bug
1. Identify the workload

Technical correctness, visual design, tone work, structured output, long context, or breadth.

2. Read the lab's positioning

Anthropic and OpenAI both publish detailed positioning. The lab tells you where the model fits.

3. Cross-check the public record

SWE-bench Verified for coding. LMArena for general capability. Community discussion for the rest.

4. Verify on your own workload

The benchmark gives you a rank order. Your workload gives you the truth.

Which one for which work

For code with non-trivial structure, default to Claude. The refactor sense, the bug-flagging instinct, and the willingness to ask a clarifying question instead of silently overwriting are documented strengths, and they show up in production every day.

Anything visual — landing pages, slide decks, dashboard mockups, work where taste matters — is worth trying on GPT-5 first. Its defaults are closer to what a contemporary audience expects, and you will spend less time reshaping the output.

Legal, medical, or financial analysis, where a confident wrong answer does more damage than an openly uncertain one, belongs with Claude. The honest hedging is the right fit.

When you're writing in a non-English language where tone matters (dialect, audience-specific voice), Claude wins more often than the leaderboard would predict on Arabic, while GPT-5 keeps an edge on most other world languages. The Arabic content piece walks through that specific axis.

And if your workload is high-volume and low-latency, with price-per-token dominating the decision, neither of these is your model. Drop to Claude Sonnet 4.6 or GPT-5 Mini and compare those instead. The small-models piece covers when to drop a tier.

The single recommendation that holds across most teams: subscribe to Claude Opus 4.7 as the default and keep a paid OpenAI key for visual design and the occasional case where GPT-5's voice fits the job. Running both costs a small premium over committing to one, and for most teams the range of work it covers more than pays for itself.

Frequently asked

Who wins between GPT-5 and Claude Opus 4.7?

On the seven workload categories this piece scores, Claude Opus 4.7 takes five and GPT-5 takes one decisively (landing-page design), with one effective tie. The verdict is consistent with the public benchmark record and each lab's own positioning of its model.

Should I subscribe to both Claude and GPT-5?

If your work spans both technical and creative-design tasks, yes. The two together handle more types of work than either alone, and the marginal cost of a second API key is usually small relative to the value of having the right model for each job.

Which is better at coding?

Claude Opus 4.7, per the SWE-bench Verified rank order Anthropic reports (87.6%) vs OpenAI's reported figure for GPT-5 (around 74.9%). Anthropic's positioning emphasizes architectural reasoning and hedging; OpenAI's positioning leans on breadth and structured output.

Which is better at writing?

GPT-5 for stylistic flexibility and breadth. Claude for technical writing and tone discipline. Each lab's positioning is consistent with the public community discussion of where the two models land.

Which model is faster?

GPT-5 is the faster model in most public reports. OpenAI does not publish single-figure tokens-per-second numbers, but the consensus across the developer community is that GPT-5 streams output faster than Opus 4.7 by a meaningful margin.

Changelog

  • May 25, 2026 — Rewrote the per-category sections so the verdicts are grounded in the public benchmark record and each lab's published positioning, not in a private lab test. Pricing verified against current provider documentation.
  • April 30, 2026 — Corrected the GPT-5 pricing reference to the published $1.25 input / $10 output per million tokens.
  • April 28, 2026 — Originally published.

References

  1. Anthropic, "Claude API Documentation," docs.claude.com, accessed May 2026.
  2. Anthropic, "Claude Pricing," anthropic.com/pricing, accessed May 2026.
  3. OpenAI, "API Documentation," platform.openai.com/docs, accessed May 2026.
  4. OpenAI, "API Pricing," openai.com/api/pricing, accessed May 2026.
  5. "Chatbot Arena leaderboard," lmarena.ai, May 2026 snapshot.
  6. "SWE-bench Verified leaderboard," swebench.com, May 2026.