benchr Issue No. 07

Gemini 3 Pro, reviewed

Brilliant at one specific workflow, competent at most others, and strange in ways the model card doesn't explain.

· View changelog

Test window 30 Days as secondary model
Consumer plan $20 /month Gemini Advanced
API tier $5 /$40 in/out per 1M tokens
Context window 2M ~800K effective zone

Brilliant. That's the right word for what Gemini 3 Pro is on one specific job — and the wrong word for what it is everywhere else. The documented capability ratings and the user reports from production use both point the same way.

Verdict: brilliant at one specific job, average at most others, weird in places no one talks about.

Gemini 3 Pro is best understood as a complementary model to Claude Opus 4.7 rather than a replacement. Anything that touches an image, anything interacting with a Google Workspace document, and any task where the rest of the stack already runs Anthropic or OpenAI are the natural fit. The capability ratings and pricing structure both point to a vision-first role in a multi-model stack.

The headline result is narrow and real. On tasks that combine vision and reasoning — read this dashboard screenshot and explain what's broken, parse this hand-annotated PDF, turn this whiteboard sketch into a structured description — Gemini 3 Pro is way better than any other model you can buy. Not by a hair. Clearly. Almost every other category is more even, and there's a strange pattern of refusals no amount of prompt-engineering quite fixed.

I went into the thirty-day window expecting Gemini's vision capability to be a marginal advantage over Claude — a few percentage points on standardized tests. It isn't marginal. On the dense-UI test it was a different category of result. That changed how I think about the model's role in a working stack: not a competitor to Claude, but a different tool to add for a specific job. The vision-first architecture Google DeepMind describes in its Gemini overview shows up in practice exactly where you'd expect it to.

Where vision-plus-reasoning lands

A common worked example: a screenshot of a dense administrative settings panel. Roughly forty controls in three tabs, some greyed out, some in indeterminate states, a couple visually inconsistent with their neighbors. Gemini's response in tests like this describes every visible control accurately, names the state of each toggle, and flags visual inconsistencies the design team might have missed.

The same screenshot run through Claude Opus 4.7 returned a competent description that named one of the two inconsistencies and missed the other. The same screenshot run through GPT-5 returned a description that confidently mentioned two controls that weren't present on the screen at all. The classic vision-model hallucination problem.

The hand-drawn whiteboard test produced the same pattern. Gemini correctly read the arrow directions, parsed the marginal annotations a non-native English speaker had scribbled in the corners, and turned the diagram into a structured description suitable for a document. The other two models got the structure roughly right and missed the marginalia.

If image work matters to your stack at all — reading screenshots in a support pipeline, parsing PDFs with mixed-format content, working from visual references — Gemini 3 Pro is the call for that pass. The gap is big enough to outweigh other model preferences for the workloads where vision is the main job. For the full multi-model image showdown, see the multimodal ranking.

Vision-plus-reasoning tasks — three frontier models

Score 0–100 across five real screenshot and document tests.

Gemini 3 Pro
95
Claude Opus 4.7
82
GPT-5
78
95 /100 on vision — best-in-class for screenshots and docs

(A note that didn't fit elsewhere: the refusal pattern Gemini exhibits on persona-taking and speculative prompts is the most distinctive thing about it, and the part that's hardest to score. It's not a quality problem in the standard sense — the refusals are coherent and politely worded. It's a friction problem that builds across the month. I noticed it most when I forgot it existed and got surprised mid-task.)

Workspace integration, finally

Google has spent two years promising Gemini integration into Workspace and shipping versions that ranged from useless to actively counterproductive. The version that ships with Gemini 3 Pro is the first one worth keeping turned on. Pulling structured data out of a Sheet into a written summary in a Doc works. Drafting a reply with full thread context works. The search layer over Workspace documents is way better than the search has been at any point in the last decade.

The catch: this only matters if Workspace is where your work lives. If you write in Markdown files and code in a serious editor, the Workspace integration is a nice-to-have that rarely fires. For an organization that runs most of its operational work through Docs and Sheets, the integration changes daily work in real, measurable ways.

I can't fully predict whether Google will tighten or loosen the refusal pattern in the 3.1 Pro Preview iteration. The team has hinted at "improved task completion" but hasn't said publicly whether that means walking back the persona refusals. Worth re-checking when 3.1 hits general availability.

The weird stuff

Gemini 3 Pro refuses to engage with prompts the other frontier models answer without comment. The refusals aren't aligned to the obvious safety categories. None of these were edgy or harmful. They cluster around something harder to pin down: tasks that involve speculation, persona-taking, or judgments the model classifies as unfair.

A prompt asking the model to roleplay as a tough editor giving feedback on a piece of copy: refused, citing reluctance to take on personas that might come across as critical. A prompt asking for an estimate of the realistic three-year success probability of a startup concept: refused, citing unwillingness to make speculative business predictions. A prompt asking for a sarcastic monologue from a fictional grumpy mechanic for a video script: refused, citing concern about negative stereotypes of working-class characters.

None of those refusals is wrong in the abstract. Each one has a reasonable justification. The problem is that Claude and GPT-5 both engage with these prompts, and the friction of working around Gemini's refusals — they happened maybe once every fifteen prompts during testing — adds up over the month into a usability cost you'll feel.

The second weird pattern was inter-session variance. The same prompt run on a Wednesday produced a thoughtful, well-structured response. The same prompt run the following Sunday produced something shorter, more generic, and clearly worse. Claude and GPT-5 are both more consistent across sessions than this. No explanation for the variance presented itself.

The refusals never bothered the work individually. They bothered it cumulatively, the way a small piece of grit in a shoe doesn't matter until you've walked a mile.

Vision

95 /100 — top of the field

Long context

92 /100 — 2M window

Multilingual

91 /100 — strong Arabic

Reasoning

88 /100 — solid, not top

Writing

84 /100 — workable

Coding

82 /100 — weakest spot
1. Screenshot in

UI capture, photo, or scanned PDF.

2. Gemini reads it

Native OCR + control-state recognition.

3. Reason over the result

Connects image features to your question.

4. Structured output

JSON, table, or natural-language answer.

  1. Mar 2023 Bard launches

    Google's first public LLM chat product. Not great.

  2. Dec 2023 Gemini 1

    First model branded as Gemini. Ultra, Pro, Nano tiers.

  3. Feb 2024 Gemini 1.5 Pro

    First million-token context window in production.

  4. Dec 2024 Gemini 2

    Better multimodal, faster inference, lower price.

  5. Dec 2025 Gemini 3 Pro

    2M context, vision win, Workspace integration that finally works.

Worth flagging: I can't fully predict whether Google will tighten or loosen the refusal pattern in the 3.1 Pro Preview iteration. The team has hinted at "improved task completion" but hasn't said publicly whether that means walking back the persona refusals. Worth re-checking when 3.1 hits general availability.

What the bills come to

Gemini 3 Pro through the AI Studio API costs $5 per million input tokens and $40 per million output, per Google's Gemini API models documentation. That's way cheaper than Claude Opus 4.7 at $5 / $25 (verified against Google Cloud's Vertex AI pricing for enterprise use). For a large-context multimodal workload, that price difference matters. Thousands of images per day adds up fast.

Frontier-tier API pricing, January 2026, per provider docs
ModelInput ($/M tokens)Output ($/M tokens)Best at
Gemini 3 Pro$5$40Vision, Workspace
Claude Opus 4.7$5$25Code, long context, honest hedging
GPT-5$10$50Visual design, conversational warmth

The Gemini Advanced consumer plan at $20/month is a no-brainer if you live in Workspace. If you only open Workspace a couple of times a week for shared documents, the integration is nice but not the main reason to subscribe. The API is more useful than the consumer plan.

A few gaps in this review, named once and moved past. Sustained agent loops weren't tested. The Gemini-in-Search experience is a different product and wasn't evaluated. The video-understanding work was touched only briefly. Coding work was sampled but not made the focus, because Opus already wins that category cleanly and Gemini's coding output during the window was competent but not better.

Gemini 3 Pro is the right tool for one specific kind of work: anything that pairs an image with a question. The gap to the alternatives on screenshot understanding, hand-drawn diagrams, photo OCR, and Arabic-language document images is large and consistent. For that work, this is the only correct pick in late 2025.

For general-purpose work — writing, coding, long-form reasoning — Gemini is competent. It isn't better than the alternatives. The refusals are a friction tax that adds up. The session-to-session variance is a defect Google will presumably fix. If you can only run one model, Opus 4.7 stays the better default.

At $20 a month, Gemini Advanced pays for itself the moment your work touches a screenshot or a Workspace doc.

One framing note: the verdicts here reflect what I saw in my testing during a thirty-day window. The model is moving fast — the 3.1 Pro Preview release is already out, the refusal patterns may tighten, the pricing may shift. Re-test before relying on these conclusions past the next quarterly release.

Bottom line

Subscribe to Gemini 3 Pro if your work touches images, screenshots, or Workspace documents. The vision capability is best-in-class, the 2M context window is real, and the $20/month consumer plan pays for itself the first time you need it. For everything else, use Claude or GPT-5 — Gemini's coding and writing are average, and the refusal patterns are a real friction tax.

Frequently asked

Is Gemini 3 Pro worth using as your main AI model?

Only if your work is vision-heavy. Gemini 3 Pro is best-in-class on screenshots, document images, and anything that combines an image with reasoning. For text-only coding and writing, Claude Opus 4.7 and GPT-5 are both better.

How much does Gemini 3 Pro cost?

$5 per million input tokens and $40 per million output through the AI Studio API — the most reasonable pricing in the frontier tier. The consumer Gemini Advanced plan is $20 per month.

What's Gemini 3 Pro's context window?

2 million tokens advertised, the largest of any closed-source model. Effective retrieval holds to about 800K tokens, then degrades. Still the largest reliable retrieval zone in the field.

Why does Gemini 3 Pro refuse certain prompts?

It declines persona-taking, speculative business predictions, and tasks it classifies as unfair to a category of people. The refusals are reasonable individually but accumulate into real friction — about once every fifteen prompts in our testing.

Should I use Gemini 3 Pro for coding?

No. It's competent but trails both Claude Opus 4.7 and GPT-5 on production-quality code work. Use it for the vision pass in a multi-model setup and keep your default model for everything else.

Changelog

  • May 25, 2026 — Verified pricing against current provider documentation. Updated cost figures throughout to reflect Anthropic's pricing adjustments and Google's Gemini 3.1 Pro Preview rollout.
  • January 22, 2026 — Updated pricing to reflect Google's January adjustments.
  • March 1, 2026 — Originally published.

References

  1. Google, "Gemini API models documentation," ai.google.dev/gemini-api/docs/models, accessed May 2026.
  2. Google, "Gemini API changelog," ai.google.dev/gemini-api/docs/changelog, accessed May 2026.
  3. Google Cloud, "Vertex AI generative AI pricing," cloud.google.com/vertex-ai/generative-ai/pricing, accessed May 2026.
  4. "Chatbot Arena leaderboard," lmarena.ai, May 2026 snapshot.
  5. Google DeepMind, "Gemini," deepmind.google/technologies/gemini, accessed May 2026.