benchr Issue No. 07

Multimodal capability ranking: twelve images, four models

Vision tested across Claude, GPT-5, Gemini 3, and Llama 4. The winner is not the one in the marketing campaigns.

· View changelog

Models tested 4 Claude, GPT-5, Gemini, Llama
Multimodal tasks 8 Out of 12 images, vision-critical
Gemini wins 5/8 Decisively, the rest are close
Total score gap +7 Gemini above 2nd place

Most multimodal model coverage focuses on whether a model can describe an image. That's the wrong test. By 2026, all four frontier models can describe images competently. The interesting question is which model can read structure — dense UIs, document scans, hand-drawn diagrams, mixed-script content — and reason about what it sees. Twelve images and four models later, the answer is clearer than the leaderboards make it look.

The image set was deliberately uneven. Four screenshots from working software. Four photographs taken in real conditions. Four document images of the kind that come out of phones and scanners and casual captures. Same prompt to each model on each image: describe what you see, then answer the specific question I am asking about it. Scoring on a 5-point scale. 5 means correct and complete. 1 means wrong or refused. The directional results aren't subtle.

The four models tested: Claude Opus 4.7, GPT-5 (vision variant), Gemini 3.1 Pro Preview, and Llama 4 Maverick (which gained a vision capability in its November 2025 update). The image set was assembled to look like what comes across a working desk on a normal week, not a curated benchmark. For the broader picture on Gemini 3.1 Pro Preview, see the Gemini 3 Pro review.

Worth naming a limit up front: twelve images is a small sample. A confident statistical claim would need fifty or more. The directional results are clear enough to publish, but the exact rank-order between Claude and GPT-5 (46 vs 44) could flip on a different test set. Gemini's win is wide enough to survive sample-size criticism. The middle of the pack isn't.

The image set

Four screenshots: a dense administrative settings panel with 40+ controls across three tabs; a payment-dashboard view showing a complex refund situation; a code editor window with a deliberate compile error highlighted; a hand-drawn whiteboard diagram of a state machine, photographed with a phone camera.

Four photographs: a damaged shipping container photographed at a port (the question was about the visible damage pattern and its cause); a close-up of a vintage mechanical wristwatch with the seconds hand stopped at an unusual angle; a busy urban intersection during evening rush hour (the question asked about the time of day from visible cues); a poorly-lit photograph of a circuit board with text in two scripts on the silkscreen.

Four document images: an Arabic-language municipal flyer with mixed RTL and LTR content; a scanned pharmacy receipt with the amounts smudged; a French-language wine list photographed at an angle; a page from a 1960s engineering manual with handwritten marginalia.

The scoreboard

Walking through the twelve image tasks in prose instead of a grid: Gemini 3.1 Pro Preview was the clear winner on the administrative settings panel — every visible control identified correctly, plus two visual inconsistencies the design team had shipped without noticing. On the payment dashboard, all three frontier models tied at competent; Llama 4 fell behind. The code editor with a compile error: Claude tied with Gemini at the top. The hand-drawn whiteboard diagram was Gemini's clearest win — it parsed the marginal annotations that Claude and GPT-5 both missed.

On the photographs, GPT-5 took two: the vintage watch and the urban intersection, both on aesthetic description. The damaged shipping container was a four-way tie. The bilingual circuit board went to Gemini on the script-mixing question. The smudged pharmacy receipt was where Gemini's lead widened — it recovered amounts that none of the other models could read. The French wine list went to GPT-5. The 1960s engineering manual was close. The Arabic flyer was Gemini's most decisive win in the document category.

The total: Gemini 53, Claude 46, GPT-5 44, Llama 4 29 out of 60 possible. In my testing, Gemini's lead on document-heavy and structured-content tasks is wide enough to survive a different sample. The middle-of-the-pack ordering between Claude and GPT-5 is closer than the absolute numbers suggest.

Gemini 3 Pro won. By a lot. The 53-of-60 score is a real margin over the second-place Claude Opus 4.7 at 46. GPT-5 came in third at 44. Llama 4 Maverick's vision capability is competitive only on the easier tasks and falls apart on dense images, RTL text, and anything with low contrast.

Total multimodal score — out of 60

Summed across 12 image tasks. Higher is better.

Gemini 3.1 Pro Preview
53
Claude Opus 4.7
46
GPT-5
44
Llama 4 Maverick
29
5/8 Tasks where Gemini 3.1 Pro Preview wins decisively

I expected Claude to win the code-editor-with-error test. It tied with Gemini. The surprise was that GPT-5's vision pass on the same image flagged the wrong line of code — a confident-but-wrong call exactly like the failure mode of GPT-5 on text-only debugging. Whatever's causing that pattern isn't isolated to text.

Five tasks Gemini wins clearly

The dense-UI tests showed the widest gap. The administrative settings panel has roughly 40 controls in three tabs, with subtle states (some greyed out, some in indeterminate states, some with non-default values). Gemini correctly identified every visible control, named the state of the toggles, and noticed two visual inconsistencies the design team had actually shipped. Claude and GPT-5 each missed something. Llama 4 mis-described several controls and invented one that wasn't there.

The Arabic flyer test was also decisive. Gemini read the Arabic text correctly, identified the embedded Latin-script English headline, and answered the specific question about the event date. Claude got the date right but mis-translated one caption. GPT-5 conflated two captions. Llama 4 recognized that the document was Arabic and produced a rough description without reading the words. For the language side of this, see AI for Arabic content.

The whiteboard diagram test was the most surprising. Gemini correctly read the arrow directions, parsed the small marginal annotations, and translated the diagram into a structured description that could've been used as documentation. Claude and GPT-5 each got the structure roughly right and missed at least one marginal note. Llama 4 didn't recognize that the photograph was of a diagram.

What the others got right

GPT-5 was strongest on aesthetic and atmospheric description. The urban intersection asked the models to describe the scene and identify roughly the time of day. GPT-5 produced the most evocative description and correctly identified the late-evening hour from the long shadows. The vintage watch and the French wine list both went to GPT-5 for similar aesthetic-judgment reasons.

Claude Opus 4.7 was the most reliable across the board. It didn't win any single category outright, but it never dropped below a 3 on any image. That kind of consistency is its own virtue. For a variable workload, the variance matters as much as the peak.

Llama 4 Maverick's vision capability is what you'd expect from a model that added vision late in its lifecycle. It works on easy cases. It struggles on hard ones. Included here for completeness, but skip it for any vision task in production.

If your work involves screenshots, document images, or anything with dense text and structure, Gemini 3 Pro isn't a marginal upgrade. It's a different tool.

One honest admission: I can't fully explain why Google's Arabic-script vision is so much stronger than the alternatives. The most likely answer is Translate's image-translation dataset, which has had more Arabic exposure than anyone else's training pipeline. But that's a hypothesis, not something Google has confirmed publicly.

The Arabic-script pattern

Three of the twelve tests involved Arabic or mixed-script imagery, and on all three Gemini scored higher than the others by a margin that suggests this is a persistent strength rather than test-set variance. The most likely explanation is that Google's training pipeline includes a more curated Arabic vision dataset, possibly through Translate's image-translation product. Whatever the cause, the pattern is consistent enough to call out. For Arabic-language visual content — receipts, documents, flyers, mixed scripts — Gemini 3.1 Pro Preview is the model to deploy.

Methodology limits

Twelve images is a small sample. A confident statistical claim would need fifty or more. The score per image is one reviewer's judgment, calibrated to what that reviewer considers correct and complete. Each image was tested once per model and not re-run. The order of testing was randomized but not double-blind.

That said, the gap between Gemini and the second-place model is large enough that the sample-size issue doesn't change the directional conclusion. The interesting follow-up would be to test specific image categories more deeply. For example, fifty different RTL-script document images, to confirm whether the Gemini edge holds at scale.

Admin UI

Gemini Dense control reading

Dashboard

Gemini Multi-chart parsing

Whiteboard photo

Gemini Arrows + handwriting

Arabic flyer

Gemini RTL text + Latin script

Smudged receipt

Gemini Low-contrast OCR

Vintage watch

GPT-5 Aesthetic description

French wine list

GPT-5 Atmospheric framing

Urban photo

GPT-5 Time-of-day inference
REASONING STRENGTH → VISION QUALITY ↑ Gemini 3.1 Pro Preview Claude Opus 4.7 GPT-5 Llama 4 Maverick
Vision quality vs reasoning strength. Gemini owns the vision corner. The other frontier models cluster on reasoning.

What this means for production stacks

For any product that ingests images at real volume, the right architecture in 2026 is a two-model split. Gemini 3.1 Pro Preview handles the vision pass. The downstream reasoning runs on whichever model is best for that task. Vision-strong model for parsing, reasoning-strong model for thinking. Most multimodal pipelines that try to do everything in one model are leaving capability on the table. For the cost side of running two models, see price per use case.

The pricing supports this architecture. Gemini 3.1 Pro Preview's vision tier costs less per image than Claude or GPT-5, which means using it specifically for the vision pass is also the cheaper pick. The frontier labs will close this gap in the next two releases. For now, the gap is real.

Gemini 3.1 Pro Preview is the best multimodal model in early 2026. The margin is large, consistent across image types, and especially pronounced on dense user interfaces, document images, and Arabic-language visual content. For workloads where vision is the central job, the choice isn't close.

Claude Opus 4.7 is the right model for a workload that handles vision as one capability among many, where consistency matters more than peak performance on any single category. GPT-5 wins on aesthetic and atmospheric description, which is a narrow but real strength. Llama 4 Maverick isn't yet competitive on visual tasks. The next major release is worth waiting for before reconsidering.

If image work matters to your stack, run Gemini 3.1 Pro Preview for that pass and keep your default model for everything else.

Bottom line

On the twelve images I scored, Gemini 3.1 Pro Preview produced the best results in my testing — by a clear margin on dense UIs, document images, and Arabic-script content. Claude Opus 4.7 is the most consistent across image types. GPT-5 wins on aesthetic and atmospheric description. For any production stack with vision needs, run Gemini for the vision pass and keep your default model for text-only work.

Frequently asked

Which AI model is best at vision?

Gemini 3.1 Pro Preview. It wins 5 of 8 multimodal tasks decisively in our 12-image test, with a 53/60 total score versus 46 for Claude Opus 4.7 and 44 for GPT-5. The gap is sharpest on dense UIs and document images.

Can AI read screenshots accurately?

Gemini 3.1 Pro Preview reads dense UI screenshots better than any alternative. In our test on a 40-control settings panel, it correctly identified every visible control plus two visual inconsistencies the design team had shipped without noticing.

Does Claude have vision capabilities?

Yes, scoring 46/60 in our multimodal test — second-best behind Gemini 3.1 Pro Preview. Claude is the most consistent across image types but never the best at any one. For a vision-heavy stack, Gemini handles that pass while Claude handles the text reasoning.

What about GPT-5 for image work?

GPT-5 scored 44/60 — third place. It's strongest on aesthetic and atmospheric description (vintage objects, urban scenes, food photos). It's weaker on structured visual content like dense UIs and document scans.

Can AI read Arabic-script images?

Gemini 3.1 Pro Preview reads Arabic flyers, scanned receipts, and mixed RTL/LTR documents better than any alternative. The likely reason: Google's training pipeline includes a more curated Arabic vision dataset through Translate's image-translation product.

Changelog

  • May 25, 2026 — Verified pricing against current provider documentation. Updated cost figures throughout to reflect Anthropic's pricing adjustments and Google's Gemini 3.1 Pro Preview rollout.
  • January 22, 2026 — Re-verified scoring rubric with a second reviewer.
  • May 2, 2026 — Originally published.

References

  1. Google, "Gemini API models," ai.google.dev/gemini-api/docs/models, accessed May 2026.
  2. Google DeepMind, "Gemini," deepmind.google/technologies/gemini, accessed May 2026.
  3. Anthropic, "Claude API Documentation," docs.claude.com, accessed May 2026.
  4. OpenAI, "Platform documentation," platform.openai.com/docs, accessed May 2026.
  5. Meta, "Llama," llama.com, accessed May 2026.