Review·May 2026

Gemini 3.1 Pro, reviewed

Name: Gemini 3.1 Pro, reviewed
Item: Gemini 3.1 Pro
Rating: 4.4
Author: benchr

The reasoning leap is real. What to watch is the long-context bill and how fast a 3.5 Pro could supersede it.

Updated May 30, 2026 · View changelog · Figures verified against official sources, 30 May 2026

Skip the hype cycle and look at one number. On ARC-AGI-2, a test built to resist memorization, Gemini 3 Pro scored 31.1%. About three months later, Gemini 3.1 Pro scores 77.1% on the same benchmark, a result Google reports and ARC Prize has verified. That is not a tuning bump. It is more than double, and it lands alongside a GPQA Diamond score in the mid-90s. The reasoning leap is real. What to watch is the long-context bill, and how fast this model may be superseded.

Both of those caveats are worth taking seriously before you wire 3.1 Pro into anything load-bearing. The model is still labeled preview, its pricing has a cliff that long agent runs will fall off of, and Google has already shown its hand on what comes next. None of that erases the capability gain. It just means the buying decision is about timing as much as quality.

77.1% ARC-AGI-2, up from 31.1% on Gemini 3 Pro about three months earlier. Google-reported, verified by ARC Prize.

A reasoning jump you can measure

ARC-AGI-2 is the headline because it is the hardest benchmark to fake. The tasks are abstract visual puzzles designed so that pattern-matching from training data doesn't carry you. A model has to reason about novel structure on the spot. Going from 31.1% to 77.1% on that test, in a single point release, is the kind of move that usually takes a full generation. ARC Prize verifying the number matters too, because it means this isn't a vendor figure standing alone.

The supporting scores point the same way. GPQA Diamond, a graduate-level science question set, comes in around 94.3% with no tools. Humanity's Last Exam lands near 44.4% without tools, and SWE-bench Verified around 80.6%. Two honest flags on those three: they are "no tools" numbers where noted, which is the harder condition, and unlike ARC-AGI-2 and the pricing, they are sourced from reports citing Google's model card rather than read off the official card directly. Treat them as vendor-reported and strong, not independently audited.

Gemini 3.1 Pro reasoning benchmarks vs Gemini 3 Pro, per Google and ARC Prize
Benchmark	Gemini 3 Pro	Gemini 3.1 Pro	Source note
ARC-AGI-2	31.1%	77.1%	Google-reported, ARC Prize verified
GPQA Diamond (no tools)	—	94.3%	Vendor-reported via model-card citation
Humanity's Last Exam (no tools)	—	44.4%	Vendor-reported via model-card citation
SWE-bench Verified	—	80.6%	Vendor-reported via model-card citation

One thing this model is not is chatty. It would be easy to assume that bigger reasoning means longer answers, but Google's own framing is that 3.1 Pro uses fewer output tokens than Gemini 3 Pro Preview while delivering more reliable results. Output efficiency is positioned as improved. That matters at the $12-to-$18 output rates, because the cost risk here doesn't come from the model rambling. It comes from the pricing structure.

The pricing cliff at 200K tokens

Gemini 3.1 Pro is priced in two tiers by prompt size, per Google's Gemini API pricing page. Up to 200K input tokens, standard rates are $2 input and $12 output per million. Above 200K, they step up to $4 input and $18 output. Batch jobs run cheaper, $1/$6 under the line and $2/$9 over it. So far that reads like normal long-context pricing.

The detail that bites is what triggers the higher rate, and what it applies to. Once your total input context crosses 200K tokens, the entire request gets billed at the long-context rate, input and output both. Not just the tokens past 200K. The whole thing. A run that would have cost $12 per million output at 199K input costs $18 per million output at 201K, on every output token it generates.

This is the single most useful thing to internalize before you deploy. A 1M-token window invites you to fill it, and the pricing punishes exactly that habit. Anyone running agentic workloads should watch input size like a budget line. The practical fixes are the usual ones, and they pay off harder here than on flat-priced models: trim context aggressively and lean on retrieval rather than full-document stuffing. benchr's guide to cutting token usage covers the mechanics, and price per use case works through when a tiered model like this beats a flat one for your specific workload.

For a model whose marketing leans on the million-token window, the cliff deserves a clear-eyed read. A big context window is a capability, not a license to use all of it on every call. The gap between what the window allows and what the bill rewards is a recurring theme, and one benchr has written about in the context of how million-token claims get marketed.

What it's good for right now

The clean fit for Gemini 3.1 Pro is hard reasoning work that fits comfortably under 200K input tokens. Graduate-level science questions, abstract problem solving, the kind of task where ARC-AGI-2 and GPQA Diamond are decent proxies. On those, this is among the strongest models available in May 2026, and the output-efficiency gain means you're not paying a verbosity tax on top of the reasoning. The 1M context window is there when you need to reach for it, with the pricing caveat front of mind.

Availability is broad for a preview. It runs through the Gemini API in Google AI Studio, plus Android Studio, Google Antigravity, and the Gemini CLI, and it's in preview on Vertex AI and Gemini Enterprise. In the consumer Gemini app, the higher limits roll out to Google AI Pro and Ultra subscribers, and it shows up in NotebookLM for those tiers. Free access exists through AI Studio under free-tier rate limits and in the consumer app, but the standout limits are gated to paid plans. So it's reachable cheaply for trials, just not unlimited for free.

Why timing is the catch

Two facts should temper how permanently you commit to this model. First, it is still a preview. The API model ID is gemini-3.1-pro-preview, it carried that label through late May 2026, and there's no confirmed general-availability date. Preview models can change behavior or pricing before they lock in, so building a production default on one is a calculated bet.

Second, Google has already shown what's next, and it isn't far off. At Google I/O on May 19, 2026, Google launched Gemini 3.5 Flash to general availability and said Gemini 3.5 Pro is in testing and arriving the following month. Google now positions the newer 3.5 Flash as beating 3.1 Pro on coding, agentic, and multimodal benchmarks. On the leaderboard front, GPT-5.5 is reported to lead ARC-AGI-2 at around 85%, above 3.1 Pro's 77.1%. So 3.1 Pro's time at the top of Google's lineup may be measured in weeks, not quarters.

A genuine capability leap and a short shelf life are not a contradiction. They're the normal shape of a fast-moving release schedule.

The verdict

Gemini 3.1 Pro earns its score on reasoning. The ARC-AGI-2 jump to a verified 77.1% is one of the cleaner capability gains of the year, the GPQA Diamond number backs it up, and the output-token efficiency means the quality doesn't arrive bundled with bloat. Go with it for hard reasoning tasks that stay under 200K input tokens, where the standard $2/$12 rate applies and the model is at its strongest.

Plan around the pricing cliff if your work is long-context or agentic, because crossing 200K input reprices the entire request and the bill climbs faster than you'd guess from the window size. And skip pinning it as a permanent default unless you're comfortable on a preview model that Google's own roadmap is about to leapfrog. A 3.5 Pro is imminent, 3.5 Flash already beats it on coding and agentic work, and GPT-5.5 sits ahead on the ARC-AGI-2 leaderboard. Stick with 3.1 Pro when the reasoning gain is the thing you need today and the prompts stay short; revisit the moment 3.5 Pro ships. For a direct benchmark scorecard between these two models, see Gemini 3.1 Pro vs GPT-5.5.

Frequently asked

Is Gemini 3.1 Pro's reasoning jump over Gemini 3 Pro real?

Yes. ARC-AGI-2 went from 31.1% on Gemini 3 Pro to 77.1% on Gemini 3.1 Pro in about three months, a result Google reports and ARC Prize has verified. It pairs that with a top-tier GPQA Diamond score around 94.3% with no tools, though that figure is sourced from Google-citing third parties rather than read off the official model card directly.

How much does Gemini 3.1 Pro cost?

Pricing is tiered by prompt size, per Google's Gemini API pricing page. Standard rates are $2 input and $12 output per million tokens for prompts up to 200K tokens, rising to $4 input and $18 output above 200K. Batch is $1/$6 under 200K and $2/$9 above. The catch: once total input crosses 200K, the entire request including output is billed at the higher long-context rate.

Is Gemini 3.1 Pro verbose?

No. Google's framing is the opposite. It states that Gemini 3.1 Pro uses fewer output tokens than Gemini 3 Pro Preview while delivering more reliable results, so output efficiency is positioned as improved. The cost risk is not verbosity; it is the long-context pricing tier, where a single prompt over 200K input pushes the whole request to the higher output rate.

Is Gemini 3.1 Pro a finished, generally available model?

Not yet. The API model ID is gemini-3.1-pro-preview, and it was still labeled preview as of late May 2026. There is no confirmed general-availability date. Treat it as a strong but provisional release rather than a locked-in long-term default.

Could Gemini 3.1 Pro be superseded soon?

Likely. At Google I/O on May 19, 2026 Google launched Gemini 3.5 Flash and said Gemini 3.5 Pro is in testing and arriving next month. Google already positions the newer 3.5 Flash as beating 3.1 Pro on coding, agentic, and multimodal benchmarks. On current leaderboards, GPT-5.5 is also reported to lead ARC-AGI-2 at around 85%, above 3.1 Pro's 77.1%.

Changelog

May 30, 2026 — Originally published. ARC-AGI-2 77.1%, the 1M context window, tiered pricing, and the gemini-3.1-pro-preview model ID verified against Google's Gemini API pricing page, Gemini API changelog, and Google's verified announcement (ARC-AGI-2 cross-checked with ARC Prize). GPQA Diamond, HLE, and SWE-bench figures labeled vendor-reported, sourced from reports citing Google's model card.

References

Google, "Gemini API pricing," ai.google.dev/gemini-api/docs/pricing, accessed May 2026.
Google, "Gemini API changelog," ai.google.dev/gemini-api/docs/changelog, accessed May 2026.
Google Cloud, "Gemini 3.1 Pro on Gemini CLI, Gemini Enterprise, and Vertex AI," cloud.google.com, Feb 2026.
Google, "Gemini 3.1 Pro," blog.google, Feb 2026.
Google DeepMind, "Gemini 3.1 Pro model card," deepmind.google, accessed May 2026.
Constellation Research, "Google launches Gemini 3.1 Pro," constellationr.com, Feb 2026.
ARC Prize, "Leaderboard," arcprize.org/leaderboard, accessed May 2026.
9to5Google, "Google I/O 2026 news," 9to5google.com, May 19, 2026.