Review·January 2026

GPT-5, reviewed

Name: GPT-5, reviewed
Item: GPT-5
Rating: 4.2
Author: benchr

Where GPT-5 differs from Claude in ways that matter: speed and breadth, plus the cracks that show on niche technical questions.

Updated May 25, 2026 · View changelog · Figures verified against official sources, 30 May 2026

Launched Aug '25 Per OpenAI's launch post

Input cost / 1M $1.25 $10 output, cached input cheaper

SWE-Bench Verified 74.9% Per OpenAI-reported figure

Context 400K Advertised

OpenAI shipped GPT-5 in August 2025, per OpenAI's launch post. GPT-5 was a big step up from GPT-4o, and nobody disputes that. The real question is whether you should still pay for it once Claude Opus 4.7 and the Gemini 3 family are also on the table. For specific work, yes. This piece walks through which work, drawing on the public benchmark record, OpenAI's own positioning, and the comparison patterns already visible in the open between the three frontier labs.

GPT-5 is the most stylistically flexible of the three serious frontier models in early 2026 and the most confident on visual and structured output. It is also the most likely to produce smooth-but-wrong answers on technical questions outside its strongest zones. None of that disqualifies it, but all of it shapes where you should put it in your stack.

Where OpenAI positions the model

OpenAI's launch material for GPT-5 emphasizes three things you should weight carefully. The first is reasoning quality on math and structured problem-solving (the company reports strong MATH benchmark performance and improved chain-of-thought behavior). The second is multimodal breadth, covering vision, audio, and text in a single model, with vision quality positioned as best-in-class. The third is "conversational warmth," the company's term for how the model handles open-ended interaction.

Those positioning claims tell you where the lab thinks the model will hold up. They're not independent verification of your workload. Put GPT-5 on a job inside its three positioned strengths and you are paying for what OpenAI built. Put it on a job outside those zones and you are betting against the lab's own design intent — a bet that sometimes pays off, but shouldn't be your default.

The pattern on code: looks right, runs wrong

The most useful public signal for coding is SWE-bench Verified, which scores models on whether they can close real GitHub issues against production repositories. OpenAI's reported figure for GPT-5 sits around 74.9% on the verified subset (per OpenAI's launch material, cross-checked against the swebench.com leaderboard). Claude Opus 4.7 sits in the high 80s on the same benchmark per Anthropic's reported figure. That gap tracks a pattern visible across the open community discussion of both models, and it is too wide to write off as benchmark noise.

The pattern: GPT-5 produces code that the compiler accepts and you have to second-guess. The structure looks right, and the function signatures use real APIs in reasonable ways. The bug is usually a misunderstanding of an edge case, an off-by-one in a slicing operation, or a misused concurrency primitive that never surfaces on the happy path. Nothing catastrophic, but all of it needs a careful reader to catch.

If your workflow is "draft fast, review by hand," that pattern is workable. It costs you when the habit is to ship what the model produces with only light review. For production work where correctness matters more than speed, Claude is the safer call. The Opus 4.7 review goes into the architectural-taste call that the rank order reflects.

The compiler isn't the reviewer that matters here. GPT-5's bugs clear the build and wait for a human to notice; Claude tends to flag its own shaky spots first.

Writing, design, and conversational breadth

The category where GPT-5 most clearly leads is open-ended writing and design. Ask the model to write in a specific voice — a 1950s detective story, say, or a 19th-century historian — and the first draft captures the tone more reliably than the alternatives. Claude's output can be better after a second pass, but GPT-5's needs less iteration to land somewhere you can ship.

That translates into a working advantage for any task where stylistic flexibility matters more than correctness or hedging: drafting copy in a brand voice nobody has documented, writing narrative prose with a deliberate atmosphere, producing decks where the words carry as much weight as the data, anything in that family. GPT-5 will give you the most useful first attempt of the three frontier models on that kind of work.

The same flexibility shows up across world languages. GPT-5 produces native-feeling output across Spanish, French, Portuguese, German, Italian, Japanese, Korean, and Mandarin, and handles a wider tail of lower-resource languages than the public benchmarks suggest. The exception you should plan around: dialectal Arabic, where the model defaults to Egyptian vocabulary even when the prompt asks for a Gulf dialect. The Arabic content piece covers where each frontier model lands on that specific axis.

Niche technical hallucination

Outside the busy corners of major languages and libraries, GPT-5 will sometimes give you a confident-looking signature that does not exist, or describe the behavior of a deprecated API as if it were current. This is a known pattern across all frontier models. It shows up more on GPT-5 than on Claude because the GPT-5 default tone is confident, and the confidence is not always calibrated to how well-grounded the answer is.

Never trust an API claim you can't verify against the docs. That is good practice for any model, but the missing hedge on GPT-5 is reason to be more disciplined about it here than on Claude, which flags uncertainty more often when it is operating at the edges. If your work involves a lot of small libraries, niche frameworks, or older APIs, that calibration gap is the cost you pay for the speed and breadth GPT-5 brings everywhere else.

What it costs

GPT-5 lists at $1.25 per million input tokens and $10 per million output tokens through the OpenAI API, per OpenAI's API pricing. That undercuts both Claude Sonnet 4.6 and Claude Opus 4.7 on a per-token basis. For most working sessions GPT-5 is the cheaper call, which means the decision usually comes down to capability fit rather than the bill. For high-volume work, GPT-5 Mini at $0.25 / $2.00 per million tokens is the fairer comparison; it competes with Sonnet for cost-sensitive workloads. The price-per-use-case piece walks through which tier fits which workload at the per-token level.

OpenAI GPT-5 tier pricing, January 2026, per OpenAI API pricing
Tier	Input ($/M tokens)	Output ($/M tokens)	Notes
GPT-5	$1.25	$10	Standard frontier tier
GPT-5 Mini	$0.25	$2.00	Distilled model for volume
GPT-5 (Batch API)	$5	$25	~50% off, 24h turnaround

The Batch API is consistently underused. For any workload that doesn't need a synchronous response — overnight document processing or bulk classification, for instance — the discount is money you're leaving on the floor. If your pipeline tolerates the 24-hour turnaround and you are not using batch, that is the first lever to pull.

Where GPT-5 earns its keep

A few categories where GPT-5 is the right pick for you.

Visual and design-heavy work is the clearest one: landing pages, presentation decks, marketing layouts. The model's aesthetic defaults are more contemporary and more confident than the alternatives, and the output needs less reshaping to land somewhere shippable. If your work involves a lot of HTML/CSS or layout reasoning, reach for GPT-5.

Next, open-ended writing where stylistic flexibility is the central requirement — voice-matching, atmospheric prose, brand work. GPT-5 captures tone on the first try more reliably than Claude. If your team ships copy more than it ships code, make GPT-5 the default and keep Claude as the backup for the technical content.

And multilingual work outside Arabic. For languages where Claude has not been specifically tuned, GPT-5 produces output with more polish on the first attempt. Anywhere your reader is reading a language other than English, it is the safer first pick, with the Arabic exception called out above.

Where to skip it

Skip GPT-5 on production code review, where catching a subtle bug on the first pass matters more than smoothness. The confident-wrong failure pattern is the worst kind of failure for working software, and Claude's hedging instincts fit the work better.

Skip it, too, on reasoning under uncertainty where honesty is part of what you're paying for — legal review, medical questions, financial analysis. On those workloads, a confident answer that turns out to be wrong does more damage than an honest "not sure."

And skip it on long-context synthesis where the model needs to hold coherence across hundreds of thousands of tokens. GPT-5 is competent in long context, but Claude is better at the cross-section synthesis work that separates a useful long-context response from a recap of one section.

OpenAI's flagship release cadence, 2022 through 2025. Roughly one major model per year, with the gap from GPT-4o to GPT-5 the longest.

The verdict that has not moved

GPT-5 is the second frontier model worth paying for in 2026. It earns its position by being faster than the alternatives on most workloads, more stylistically flexible, broader across languages, and more visually opinionated. What keeps it out of the top slot is the calibration gap on technical questions, plus a code failure mode — compiles but runs wrong — that's exactly the wrong shape for production work, where a bad answer tends to survive review.

For most teams, the recommendation is the same as it has been since launch coverage: Claude Opus 4.7 as the default for technical work, GPT-5 as the supplementary subscription for the categories where it leads — design, writing, multilingual breadth, structured output. Two API keys and one bill is the working stack benchr keeps coming back to.

If you're forced to pick one for cost reasons, let the work decide. Teams that write more than they code should take GPT-5; teams that code more than they write should take Claude. Most readers sit in the middle, and for them the both-models setup is the answer.

Frequently asked

Is GPT-5 better than Claude Opus 4.7?

Not overall. OpenAI positions GPT-5 on visual design, structured output, and breadth. Anthropic positions Claude Opus on coding, reasoning under uncertainty, and long-document analysis. The public leaderboards (SWE-bench Verified, LMArena) are consistent with both positioning claims.

How much does GPT-5 cost?

$1.25 per million input tokens and $10 per million output. GPT-5 Mini drops to $0.25 / $2.00 for high-volume work. The Batch API offers roughly half off with a 24-hour turnaround.

What's the context window on GPT-5?

400K tokens advertised. As with all long-context models, retrieval is more reliable than one-shot summarization across the full window. Verify on your workload before relying on it past the middle of the range.

When should you choose GPT-5 over the alternatives?

Visual design tasks (landing pages, layouts), structured output (JSON, XML), math-heavy reasoning, and conversational English where stylistic flexibility matters. Pick something else when calibrated hedging on technical questions is part of the value.

Does GPT-5 hallucinate?

On niche technical questions, the default tone is confident even when the answer is not well-grounded. That is a known pattern across all frontier models, but it shows up more on GPT-5 than on Claude. Verify API signatures and citations against primary sources.

Changelog

May 25, 2026 — Rewrote sections that previously narrated original lab tests; the article now grounds its verdict in published benchmarks, OpenAI's positioning, and the public leaderboard record. Pricing verified against current OpenAI documentation.
January 22, 2026 — Re-verified pricing against the OpenAI pricing page.
January 8, 2026 — Updated throughput notes after OpenAI's January API update.
January 4, 2026 — Originally published.

References

OpenAI, "API Documentation," platform.openai.com/docs, accessed May 2026.
OpenAI, "API Pricing," openai.com/api/pricing, accessed May 2026.
"Chatbot Arena leaderboard," lmarena.ai, May 2026 snapshot.
OpenAI, "Introducing GPT-5," openai.com/index/introducing-gpt-5, August 2025.
"SWE-bench Verified leaderboard," swebench.com, May 2026.