OpenAI shipped GPT-5 in August 2025, per OpenAI's launch post. GPT-5 was a big step up from GPT-4o, and nobody disputes that. The real question is whether you should still pay for it once Claude Opus 4.7 and the Gemini 3 family are also on the table. For specific work, yes. This piece walks through which work, drawing on the public benchmark record, OpenAI's own positioning, and the comparison patterns already visible in the open between the three frontier labs.
GPT-5 is the most stylistically flexible of the three serious frontier models in early 2026 and the most confident on visual and structured output. It is also the most likely to produce smooth-but-wrong answers on technical questions outside its strongest zones. None of that disqualifies it, but all of it shapes where you should put it in your stack.
Where OpenAI positions the model
OpenAI's launch material for GPT-5 emphasizes three things you should weight carefully. The first is reasoning quality on math and structured problem-solving (the company reports strong MATH benchmark performance and improved chain-of-thought behavior). The second is multimodal breadth, covering vision, audio, and text in a single model, with vision quality positioned as best-in-class. The third is "conversational warmth," the company's term for how the model handles open-ended interaction.
Those positioning claims tell you where the lab thinks the model will hold up. They're not independent verification of your workload. Put GPT-5 on a job inside its three positioned strengths and you are paying for what OpenAI built. Put it on a job outside those zones and you are betting against the lab's own design intent — a bet that sometimes pays off, but shouldn't be your default.
The pattern on code: looks right, runs wrong
The most useful public signal for coding is SWE-bench Verified, which scores models on whether they can close real GitHub issues against production repositories. OpenAI's reported figure for GPT-5 sits around 74.9% on the verified subset (per OpenAI's launch material, cross-checked against the swebench.com leaderboard). Claude Opus 4.7 sits in the high 80s on the same benchmark per Anthropic's reported figure. That gap tracks a pattern visible across the open community discussion of both models, and it is too wide to write off as benchmark noise.
The pattern: GPT-5 produces code that the compiler accepts and you have to second-guess. The structure looks right, and the function signatures use real APIs in reasonable ways. The bug is usually a misunderstanding of an edge case, an off-by-one in a slicing operation, or a misused concurrency primitive that never surfaces on the happy path. Nothing catastrophic, but all of it needs a careful reader to catch.
If your workflow is "draft fast, review by hand," that pattern is workable. It costs you when the habit is to ship what the model produces with only light review. For production work where correctness matters more than speed, Claude is the safer call. The Opus 4.7 review goes into the architectural-taste call that the rank order reflects.
The compiler isn't the reviewer that matters here. GPT-5's bugs clear the build and wait for a human to notice; Claude tends to flag its own shaky spots first.
Writing, design, and conversational breadth
The category where GPT-5 most clearly leads is open-ended writing and design. Ask the model to write in a specific voice — a 1950s detective story, say, or a 19th-century historian — and the first draft captures the tone more reliably than the alternatives. Claude's output can be better after a second pass, but GPT-5's needs less iteration to land somewhere you can ship.
That translates into a working advantage for any task where stylistic flexibility matters more than correctness or hedging: drafting copy in a brand voice nobody has documented, writing narrative prose with a deliberate atmosphere, producing decks where the words carry as much weight as the data, anything in that family. GPT-5 will give you the most useful first attempt of the three frontier models on that kind of work.
The same flexibility shows up across world languages. GPT-5 produces native-feeling output across Spanish, French, Portuguese, German, Italian, Japanese, Korean, and Mandarin, and handles a wider tail of lower-resource languages than the public benchmarks suggest. The exception you should plan around: dialectal Arabic, where the model defaults to Egyptian vocabulary even when the prompt asks for a Gulf dialect. The Arabic content piece covers where each frontier model lands on that specific axis.
Niche technical hallucination
Outside the busy corners of major languages and libraries, GPT-5 will sometimes give you a confident-looking signature that does not exist, or describe the behavior of a deprecated API as if it were current. This is a known pattern across all frontier models. It shows up more on GPT-5 than on Claude because the GPT-5 default tone is confident, and the confidence is not always calibrated to how well-grounded the answer is.
Never trust an API claim you can't verify against the docs. That is good practice for any model, but the missing hedge on GPT-5 is reason to be more disciplined about it here than on Claude, which flags uncertainty more often when it is operating at the edges. If your work involves a lot of small libraries, niche frameworks, or older APIs, that calibration gap is the cost you pay for the speed and breadth GPT-5 brings everywhere else.
What it costs
GPT-5 lists at $1.25 per million input tokens and $10 per million output tokens through the OpenAI API, per OpenAI's API pricing. That undercuts both Claude Sonnet 4.6 and Claude Opus 4.7 on a per-token basis. For most working sessions GPT-5 is the cheaper call, which means the decision usually comes down to capability fit rather than the bill. For high-volume work, GPT-5 Mini at $0.25 / $2.00 per million tokens is the fairer comparison; it competes with Sonnet for cost-sensitive workloads. The price-per-use-case piece walks through which tier fits which workload at the per-token level.
| Tier | Input ($/M tokens) | Output ($/M tokens) | Notes |
|---|---|---|---|
| GPT-5 | $1.25 | $10 | Standard frontier tier |
| GPT-5 Mini | $0.25 | $2.00 | Distilled model for volume |
| GPT-5 (Batch API) | $5 | $25 | ~50% off, 24h turnaround |
The Batch API is consistently underused. For any workload that doesn't need a synchronous response — overnight document processing or bulk classification, for instance — the discount is money you're leaving on the floor. If your pipeline tolerates the 24-hour turnaround and you are not using batch, that is the first lever to pull.
Where GPT-5 earns its keep
A few categories where GPT-5 is the right pick for you.
Visual and design-heavy work is the clearest one: landing pages, presentation decks, marketing layouts. The model's aesthetic defaults are more contemporary and more confident than the alternatives, and the output needs less reshaping to land somewhere shippable. If your work involves a lot of HTML/CSS or layout reasoning, reach for GPT-5.
Next, open-ended writing where stylistic flexibility is the central requirement — voice-matching, atmospheric prose, brand work. GPT-5 captures tone on the first try more reliably than Claude. If your team ships copy more than it ships code, make GPT-5 the default and keep Claude as the backup for the technical content.
And multilingual work outside Arabic. For languages where Claude has not been specifically tuned, GPT-5 produces output with more polish on the first attempt. Anywhere your reader is reading a language other than English, it is the safer first pick, with the Arabic exception called out above.
Where to skip it
Skip GPT-5 on production code review, where catching a subtle bug on the first pass matters more than smoothness. The confident-wrong failure pattern is the worst kind of failure for working software, and Claude's hedging instincts fit the work better.
Skip it, too, on reasoning under uncertainty where honesty is part of what you're paying for — legal review, medical questions, financial analysis. On those workloads, a confident answer that turns out to be wrong does more damage than an honest "not sure."
And skip it on long-context synthesis where the model needs to hold coherence across hundreds of thousands of tokens. GPT-5 is competent in long context, but Claude is better at the cross-section synthesis work that separates a useful long-context response from a recap of one section.
The verdict that has not moved
GPT-5 is the second frontier model worth paying for in 2026. It earns its position by being faster than the alternatives on most workloads, more stylistically flexible, broader across languages, and more visually opinionated. What keeps it out of the top slot is the calibration gap on technical questions, plus a code failure mode — compiles but runs wrong — that's exactly the wrong shape for production work, where a bad answer tends to survive review.
For most teams, the recommendation is the same as it has been since launch coverage: Claude Opus 4.7 as the default for technical work, GPT-5 as the supplementary subscription for the categories where it leads — design, writing, multilingual breadth, structured output. Two API keys and one bill is the working stack benchr keeps coming back to.
If you're forced to pick one for cost reasons, let the work decide. Teams that write more than they code should take GPT-5; teams that code more than they write should take Claude. Most readers sit in the middle, and for them the both-models setup is the answer.