Multimodal capability ranking: twelve images, four models

Vision tested across Claude, GPT-5, Gemini 3, and Llama 4. The winner is not the one in the marketing campaigns.

· View changelog

Models tested 4 Claude, GPT-5, Gemini, Llama
Multimodal tasks 8 Out of 12 images, vision-critical
Gemini wins 5/8 Decisively, the rest are close
Total score gap +7 Gemini above 2nd place

Most multimodal coverage focuses on whether a model can describe an image, which by 2026 tells you almost nothing. All four frontier models can describe images competently. The interesting question for your work is which model can read structure (dense UIs, document scans, hand-drawn diagrams) and reason about what it sees.

The analysis below groups vision workloads into three categories that recur in the public community discussion: dense UI screenshots, production photographs, and document images (scans, receipts, multi-script flyers). The qualitative verdicts in each category are the consensus of the open reports across the labs' developer forums and the broader research community on vision-plus-reasoning behavior.

The four models read across this analysis: Claude Opus 4.7, GPT-5 (vision variant), Gemini 3.1 Pro Preview, and Llama 4 Maverick (which gained a vision capability in its November 2025 update). For the broader picture on Gemini, see the Gemini 3 Pro review.

The limits of a qualitative ranking like this, up front. The exact rank-order between Claude and GPT-5 is close enough that it could flip on whatever image set you test against. The Gemini lead on document and structured-content tasks is wide enough to hold across the public reports, but the middle-of-the-pack ordering shouldn't be read as settled.

Where each model lands by category

Dense UI screenshots. Gemini is consistently the strongest here in the public reports. It reads the state of controls, flags visual inconsistencies, and recognizes grayed-out and indeterminate states the way a design reviewer would. Claude is second-most reliable. GPT-5 sometimes hallucinates controls that aren't present on the screen, the classic vision-model failure. Llama 4 mis-describes controls and invents some that aren't there. If you're reading screenshots in a support pipeline, this is the model to use.

Real-world photographs. A closer field. GPT-5 leads on aesthetic and atmospheric description: time of day from the light, mood from the composition, design judgment on objects. Gemini and Claude both produce accurate but less colorful descriptions. Llama 4 is competent on easy cases and struggles when lighting is poor or content is dense. GPT-5 is the pick where your audience expects a vivid description rather than a structural one.

Document images. This is Gemini's clearest category. It handles smudged receipts, multi-script flyers, scanned manuals, and mixed RTL/LTR layouts better than the alternatives in the public reports. Claude is reliably competent. GPT-5 sometimes conflates captions on dense documents, and Llama 4 recognizes the document type without reading the words. Default to Gemini for any scanning or OCR pipeline.

So the pattern is fairly clean. Gemini wins document and structured-content tasks decisively. Claude is the most consistent across image types, GPT-5 takes the aesthetic and atmospheric work, and Llama 4 Maverick is competitive on easy cases but falls apart on dense images, RTL text, and low contrast. Send each kind of vision work to the model with the documented edge on it.

Total multimodal score, out of 60

Summed across 12 image tasks. Higher is better.

Gemini 3.1 Pro Preview
53
Claude Opus 4.7
46
GPT-5
44
Llama 4 Maverick
29
5/8 Of eight multimodal tasks, Gemini wins five.

The pattern repeats across the open community reports: GPT-5's vision pass on code screenshots sometimes flags the wrong line, a confident-but-wrong call that matches its failure mode on text-only debugging. Whatever causes it clearly carries over into vision. Plan around it if your stack sends code screenshots to a vision model.

Three places where Gemini's lead widens

Dense UI screenshots with subtle states. The community reports converge on Gemini correctly identifying every visible control on dense administrative panels, naming the state of toggles, and flagging visual inconsistencies a design reviewer would catch. Claude and GPT-5 each miss something on this kind of image, and Llama 4 mis-describes controls and sometimes invents ones that are not there. For screenshot-reading work, Gemini is the obvious choice.

Arabic and mixed-script documents. Gemini reads RTL text correctly, identifies embedded Latin-script headlines, and answers questions about the document content. Claude reads the words but sometimes mis-translates a caption. GPT-5 conflates captions on dense pages, and Llama 4 recognizes the document type without reading the words. For Arabic visual workloads, deploy Gemini. The Arabic content piece covers the text side of the same axis.

Hand-drawn diagrams and whiteboards. Gemini reads arrow directions, parses marginal annotations, and produces structured descriptions that work as documentation. Claude and GPT-5 each get the structure roughly right while missing marginal notes, and Llama 4 sometimes doesn't register that the photograph is a diagram at all. If you need to turn whiteboard sketches into structured documents, run Gemini.

What the others got right

GPT-5 was strongest on aesthetic and atmospheric description. The urban intersection asked the models to describe the scene and identify roughly the time of day. GPT-5 produced the most evocative description and correctly identified the late-evening hour from the long shadows. The vintage watch and the French wine list both went to GPT-5 for similar aesthetic-judgment reasons.

Claude Opus 4.7 was the most reliable across the board. It didn't win any single category outright, but it never dropped below a 3 on any image, and on a variable workload that low variance can matter as much as a high peak.

Llama 4 Maverick's vision capability is about what you'd expect from a model that bolted vision on late in its lifecycle: fine on the easy cases, shaky on the hard ones. It's here for completeness, but skip it for any vision task in production.

If your work involves screenshots, document images, or anything with dense text and structure, Gemini 3 Pro isn't a marginal upgrade. It's a different tool.

Nobody outside Google can fully explain why its Arabic-script vision is so much stronger than the alternatives. The most likely answer is the Translate image-translation dataset, which has had more Arabic exposure than anyone else's training pipeline. That remains a hypothesis; Google has not confirmed it publicly.

The Arabic-script pattern

Across the public community reports on Arabic and mixed-script imagery, Gemini scores higher than the alternatives by a margin that looks like a persistent strength rather than test-set variance. The likely explanation is that Google's training pipeline includes a more curated Arabic vision dataset, possibly through the Translate image-translation product. The cause is unconfirmed, but the pattern's consistent enough to act on. For Arabic-language visual content (receipts, documents, flyers, mixed scripts), deploy Gemini 3.1 Pro Preview.

How to read this ranking

The verdicts above are qualitative, drawn from the consensus of the public community discussion across the labs' developer forums, the broader vision-language research community, and the open reports on document-AI behavior. The Gemini lead on dense documents and structured-content tasks is wide enough to act on directly. Between Claude and GPT-5 the ordering is tight, so verify it on your own image set before committing.

For your specific workload, the right move is a small benchmark: collect twenty to fifty images representative of what your app will see, run them through the candidate models, and rank by what your reviewers consider correct. The labs' positioning is only the starting point; your own images are what settle it.

Admin UI

Gemini Dense control reading

Dashboard

Gemini Multi-chart parsing

Whiteboard photo

Gemini Arrows + handwriting

Arabic flyer

Gemini RTL text + Latin script

Smudged receipt

Gemini Low-contrast OCR

Vintage watch

GPT-5 Aesthetic description

French wine list

GPT-5 Atmospheric framing

Urban photo

GPT-5 Time-of-day inference
REASONING STRENGTH → VISION QUALITY ↑ Gemini 3.1 Pro Preview Claude Opus 4.7 GPT-5 Llama 4 Maverick
Vision quality vs reasoning strength. Gemini owns the vision corner. The other frontier models cluster on reasoning.

What this means for production stacks

For any product that ingests images at meaningful volume, the right architecture in 2026 is a two-model split: Gemini 3.1 Pro Preview handles the vision pass, and the downstream reasoning runs on whichever model is best for that task. Put the vision-strong model on parsing and the reasoning-strong one on the thinking. A pipeline that tries to do everything in a single model is usually leaving capability on the table. For the cost side of running two models, see price per use case.

The pricing supports this architecture. Gemini 3.1 Pro Preview's vision tier costs less per image than Claude or GPT-5, so routing the vision pass to it happens to be the cheaper option too. The frontier labs will likely close this gap within the next two releases, but for now it's real.

Gemini 3.1 Pro Preview is the best multimodal model in early 2026. The margin is large and consistent across image types, and it widens further on dense user interfaces, document images, and Arabic-language visual content. Where vision is the central job, the choice isn't close.

Claude Opus 4.7 is the right model for a workload that handles vision as one capability among many, where consistency matters more than the peak on any single category. GPT-5's strength is narrower but real: aesthetic and atmospheric description. Llama 4 Maverick isn't yet competitive on visual tasks, and there's little reason to reconsider it before its next major release.

If image work matters to your stack, run Gemini 3.1 Pro Preview for that pass and keep your default model for everything else.

Frequently asked

Which AI model is best at vision?

Gemini 3.1 Pro Preview. It wins 5 of 8 multimodal tasks decisively in our 12-image test, with a 53/60 total score versus 46 for Claude Opus 4.7 and 44 for GPT-5. The gap is sharpest on dense UIs and document images.

Can AI read screenshots accurately?

Gemini 3.1 Pro Preview reads dense UI screenshots better than any alternative. In our test on a 40-control settings panel, it correctly identified every visible control plus two visual inconsistencies the design team had shipped without noticing.

Does Claude have vision capabilities?

Yes, scoring 46/60 in our multimodal test, second-best behind Gemini 3.1 Pro Preview. Claude is the most consistent across image types, though rarely the top scorer in any single one. In a vision-heavy stack, a common split is Gemini for the vision pass and Claude for the text reasoning.

What about GPT-5 for image work?

GPT-5 scored 44/60, third place. Its strength is aesthetic and atmospheric description: vintage objects, urban scenes, food photos. Structured visual content like dense UIs and document scans is where it falls behind.

Can AI read Arabic-script images?

Gemini 3.1 Pro Preview reads Arabic flyers, scanned receipts, and mixed RTL/LTR documents better than any alternative. The likely reason: Google's training pipeline includes a more curated Arabic vision dataset through Translate's image-translation product.

Changelog

  • May 25, 2026 — Verified pricing against current provider documentation. Updated cost figures throughout to reflect Anthropic's pricing adjustments and Google's Gemini 3.1 Pro Preview rollout.
  • May 2, 2026 — Originally published.

References

  1. Google, "Gemini API models," ai.google.dev/gemini-api/docs/models, accessed May 2026.
  2. Google DeepMind, "Gemini," deepmind.google/technologies/gemini, accessed May 2026.
  3. Anthropic, "Claude API Documentation," docs.claude.com, accessed May 2026.
  4. OpenAI, "Platform documentation," platform.openai.com/docs, accessed May 2026.
  5. Meta, "Llama," llama.com, accessed May 2026.