The question with GPT-5.5 isn't whether it's good. It's whether moving off GPT-5 earns its keep, because OpenAI raised the API price significantly to get there. GPT-5.5 lists at $5 input and $30 output per million tokens; GPT-5 lists at $1.25 input and $10 output per official OpenAI model docs — making GPT-5.5 roughly 4× more on input and 3× more on output. That's a real line-item change for anyone running volume, and it reframes the decision: not "is GPT-5.5 better" but "is it better on the work you do day to day, by enough to justify that price difference."
The honest read is that the upgrade is narrow and aimed. OpenAI concentrated it where models have been weakest and where the money is heading: agentic coding and computer use, the model writing and debugging code, operating software, and grinding through multi-step tool use until a task is finished. If that's your workload, the case is strong. If it isn't, the case gets thin fast. benchr's earlier GPT-5 review lands on a both-models stack for most teams; GPT-5.5 doesn't overturn that so much as sharpen where the OpenAI key earns its slot.
What changed under the hood
OpenAI pitches GPT-5.5 as "a new class of intelligence" for agentic coding and professional knowledge work. Strip the marketing and the concrete claim is this: the model is better at staying on a task across many steps, reading output, deciding the next action, and not losing the thread halfway through a long tool-use loop. That's the failure mode that has kept agents from being trustworthy, and it's the axis where a small win compounds across a session.
The two headline numbers back the positioning, with a caveat worth stating up front. On Terminal-Bench 2.0, which scores command-line agent tasks, OpenAI reports 82.7%. On OSWorld-Verified, the computer-use benchmark where a model drives a real desktop environment, it reports 78.7%. Both are OpenAI's own evals, relayed by third-party write-ups citing the launch announcement rather than read off a neutral leaderboard. So weight them as vendor figures: directionally useful, not independently confirmed.
| Benchmark | GPT-5.5 | GPT-5 |
|---|---|---|
| Terminal-Bench 2.0 (agentic coding / terminal) | 82.7% | Not reported on this version |
| OSWorld-Verified (computer use) | 78.7% | Not reported on this version |
| SWE-bench Pro | 58.6% | Not reported on this version |
A note on what that table can and can't show. The agentic and computer-use benchmarks here track the version of the test OpenAI ran for the GPT-5.5 launch; matching GPT-5 figures on the same test versions weren't part of the verified record, so the comparison rows are blank rather than guessed. The SWE-bench Pro figure of 58.6% is reporter-relayed and shows up mostly in competitor comparisons, not as a headline OpenAI metric, so read it as a rough placement, not a clean head-to-head. There's also a widely circulated SWE-bench Verified figure of 88.7% floating around third-party leaderboards; it isn't confirmed in OpenAI's announcement, so it's left out here on purpose.
Price doubled. Did the value?
This is the crux. At $5 / $30 per million tokens, GPT-5.5 runs roughly 4× GPT-5's input price and 3× its output price. OpenAI describes GPT-5.5 as more token-efficient than its predecessors, so the effective cost on a given task doesn't scale up proportionally with the sticker — but how much that helps depends entirely on your workload, and it's the first thing to measure before you switch.
For agent work, the math can favor the upgrade even at the higher rate. A model that finishes a multi-step task in one clean run is cheaper than a cheaper model that stalls, backtracks, and burns a second full attempt. The retry tax is where agent budgets quietly bleed out. If GPT-5.5's steadier long-loop behavior cuts your failed-run rate, the per-token premium can come out ahead on the invoice that matters, the one for completed work. benchr's GPT-5 versus Claude Opus comparison walks through how those task-level economics swing the call between frontier models, and the same logic applies inside the OpenAI lineup.
For everything that isn't agentic, the premium is harder to defend. Chat, short prompts, classification, single-shot drafting: none of it leans on the long-loop strength GPT-5.5 was built around, so you'd be paying the agentic-coding tax on work that doesn't use it. For that profile, GPT-5 stays the better-value pick, and the cached-input rate of $0.50 per million on GPT-5.5 only matters if you're reusing large fixed contexts. If your bill is sensitive, benchr's guide to cutting token usage moves the needle more than the model swap will.
Two "5.5" models you must not confuse
One trap is worth flagging because it's easy to fall into. There are two separate releases wearing the 5.5 badge. The first is the flagship, GPT-5.5 and the higher-end GPT-5.5 Pro, announced April 23, 2026; it's the high-end coding and pro-work model this review covers, priced at $5 / $30 (Pro at $30 / $180). The second is GPT-5.5 Instant, released May 5, 2026 as the new ChatGPT default that replaced GPT-5.3 Instant. Instant is a fast everyday chat model, exposed in the API as "chat-latest"; paid users keep GPT-5.3 Instant available for roughly three months.
The reason this matters for an upgrade decision: if you're a ChatGPT user, you may already be on GPT-5.5 Instant by default without ever touching the flagship. Seeing "5.5" in your chat client doesn't mean you're running the model whose Terminal-Bench and OSWorld scores headline this page. The flagship lives in the API and in Codex, and rolled out to paid ChatGPT tiers (Plus, Pro, Business, Enterprise), with GPT-5.5 Pro limited to Pro, Business, and Enterprise.
Who should move, and who should wait
Go with GPT-5.5 if you're building or running agents: a model in a terminal loop, a coding agent closing tickets across a repo, or a computer-use setup driving software through a multi-step job. That's the whole point of the release, and it's where the vendor numbers and the design intent line up. Pair it with the Batch tier ($2.50 / $15.00) for any non-interactive agent run that tolerates a delayed return, and the premium gets easier to swallow.
Reach for GPT-5.5 Pro only when a task is hard enough that a higher success rate is worth $30 / $180 per million: gnarly debugging, high-stakes multi-step work where a second failed attempt costs more than the token premium. For most teams it's a specialist tool you call deliberately, not a default.
Stick with GPT-5 when your work is conversational, short-form, or single-shot, where GPT-5.5's long-loop edge never gets exercised and you'd be paying double for headroom you don't touch. And if you're weighing GPT-5.5 against the other frontier labs rather than against its own predecessor, the agentic-coding race is a close one right now. benchr's Claude Opus 4.8 review covers the strongest competing position on terminal and computer-use work, and the two are tight enough that you should test both on your own tasks before committing a quarter of agent spend to either.
The verdict
GPT-5.5 is the cleanest agentic-coding and computer-use model OpenAI has shipped, and the vendor benchmarks point the same direction the price does, toward agents. As a targeted upgrade it earns a high mark. As a blanket "replace GPT-5 everywhere" move, it doesn't make the case, because the gains don't show up on work that isn't multi-step tool use, and the bill roughly doubles regardless.
The buying advice is simple. Route your agent and coding-loop traffic to GPT-5.5, keep cheaper-per-token GPT-5 for chat and single-shot work, and reserve GPT-5.5 Pro for the handful of tasks where a higher hit rate beats a bigger invoice. Measure the failed-run rate before and after the switch; that number, not the leaderboard, tells you whether the upgrade paid for itself.