Most multimodal coverage focuses on whether a model can describe an image, which by 2026 tells you almost nothing. All four frontier models can describe images competently. The interesting question for your work is which model can read structure (dense UIs, document scans, hand-drawn diagrams) and reason about what it sees.
The analysis below groups vision workloads into three categories that recur in the public community discussion: dense UI screenshots, production photographs, and document images (scans, receipts, multi-script flyers). The qualitative verdicts in each category are the consensus of the open reports across the labs' developer forums and the broader research community on vision-plus-reasoning behavior.
The four models read across this analysis: Claude Opus 4.7, GPT-5 (vision variant), Gemini 3.1 Pro Preview, and Llama 4 Maverick (which gained a vision capability in its November 2025 update). For the broader picture on Gemini, see the Gemini 3 Pro review.
The limits of a qualitative ranking like this, up front. The exact rank-order between Claude and GPT-5 is close enough that it could flip on whatever image set you test against. The Gemini lead on document and structured-content tasks is wide enough to hold across the public reports, but the middle-of-the-pack ordering shouldn't be read as settled.
Where each model lands by category
Dense UI screenshots. Gemini is consistently the strongest here in the public reports. It reads the state of controls, flags visual inconsistencies, and recognizes grayed-out and indeterminate states the way a design reviewer would. Claude is second-most reliable. GPT-5 sometimes hallucinates controls that aren't present on the screen, the classic vision-model failure. Llama 4 mis-describes controls and invents some that aren't there. If you're reading screenshots in a support pipeline, this is the model to use.
Real-world photographs. A closer field. GPT-5 leads on aesthetic and atmospheric description: time of day from the light, mood from the composition, design judgment on objects. Gemini and Claude both produce accurate but less colorful descriptions. Llama 4 is competent on easy cases and struggles when lighting is poor or content is dense. GPT-5 is the pick where your audience expects a vivid description rather than a structural one.
Document images. This is Gemini's clearest category. It handles smudged receipts, multi-script flyers, scanned manuals, and mixed RTL/LTR layouts better than the alternatives in the public reports. Claude is reliably competent. GPT-5 sometimes conflates captions on dense documents, and Llama 4 recognizes the document type without reading the words. Default to Gemini for any scanning or OCR pipeline.
So the pattern is fairly clean. Gemini wins document and structured-content tasks decisively. Claude is the most consistent across image types, GPT-5 takes the aesthetic and atmospheric work, and Llama 4 Maverick is competitive on easy cases but falls apart on dense images, RTL text, and low contrast. Send each kind of vision work to the model with the documented edge on it.
The pattern repeats across the open community reports: GPT-5's vision pass on code screenshots sometimes flags the wrong line, a confident-but-wrong call that matches its failure mode on text-only debugging. Whatever causes it clearly carries over into vision. Plan around it if your stack sends code screenshots to a vision model.
Three places where Gemini's lead widens
Dense UI screenshots with subtle states. The community reports converge on Gemini correctly identifying every visible control on dense administrative panels, naming the state of toggles, and flagging visual inconsistencies a design reviewer would catch. Claude and GPT-5 each miss something on this kind of image, and Llama 4 mis-describes controls and sometimes invents ones that are not there. For screenshot-reading work, Gemini is the obvious choice.
Arabic and mixed-script documents. Gemini reads RTL text correctly, identifies embedded Latin-script headlines, and answers questions about the document content. Claude reads the words but sometimes mis-translates a caption. GPT-5 conflates captions on dense pages, and Llama 4 recognizes the document type without reading the words. For Arabic visual workloads, deploy Gemini. The Arabic content piece covers the text side of the same axis.
Hand-drawn diagrams and whiteboards. Gemini reads arrow directions, parses marginal annotations, and produces structured descriptions that work as documentation. Claude and GPT-5 each get the structure roughly right while missing marginal notes, and Llama 4 sometimes doesn't register that the photograph is a diagram at all. If you need to turn whiteboard sketches into structured documents, run Gemini.
What the others got right
GPT-5 was strongest on aesthetic and atmospheric description. The urban intersection asked the models to describe the scene and identify roughly the time of day. GPT-5 produced the most evocative description and correctly identified the late-evening hour from the long shadows. The vintage watch and the French wine list both went to GPT-5 for similar aesthetic-judgment reasons.
Claude Opus 4.7 was the most reliable across the board. It didn't win any single category outright, but it never dropped below a 3 on any image, and on a variable workload that low variance can matter as much as a high peak.
Llama 4 Maverick's vision capability is about what you'd expect from a model that bolted vision on late in its lifecycle: fine on the easy cases, shaky on the hard ones. It's here for completeness, but skip it for any vision task in production.
If your work involves screenshots, document images, or anything with dense text and structure, Gemini 3 Pro isn't a marginal upgrade. It's a different tool.
Nobody outside Google can fully explain why its Arabic-script vision is so much stronger than the alternatives. The most likely answer is the Translate image-translation dataset, which has had more Arabic exposure than anyone else's training pipeline. That remains a hypothesis; Google has not confirmed it publicly.
The Arabic-script pattern
Across the public community reports on Arabic and mixed-script imagery, Gemini scores higher than the alternatives by a margin that looks like a persistent strength rather than test-set variance. The likely explanation is that Google's training pipeline includes a more curated Arabic vision dataset, possibly through the Translate image-translation product. The cause is unconfirmed, but the pattern's consistent enough to act on. For Arabic-language visual content (receipts, documents, flyers, mixed scripts), deploy Gemini 3.1 Pro Preview.
How to read this ranking
The verdicts above are qualitative, drawn from the consensus of the public community discussion across the labs' developer forums, the broader vision-language research community, and the open reports on document-AI behavior. The Gemini lead on dense documents and structured-content tasks is wide enough to act on directly. Between Claude and GPT-5 the ordering is tight, so verify it on your own image set before committing.
For your specific workload, the right move is a small benchmark: collect twenty to fifty images representative of what your app will see, run them through the candidate models, and rank by what your reviewers consider correct. The labs' positioning is only the starting point; your own images are what settle it.
Admin UI
Gemini Dense control readingDashboard
Gemini Multi-chart parsingWhiteboard photo
Gemini Arrows + handwritingArabic flyer
Gemini RTL text + Latin scriptSmudged receipt
Gemini Low-contrast OCRVintage watch
GPT-5 Aesthetic descriptionFrench wine list
GPT-5 Atmospheric framingUrban photo
GPT-5 Time-of-day inferenceWhat this means for production stacks
For any product that ingests images at meaningful volume, the right architecture in 2026 is a two-model split: Gemini 3.1 Pro Preview handles the vision pass, and the downstream reasoning runs on whichever model is best for that task. Put the vision-strong model on parsing and the reasoning-strong one on the thinking. A pipeline that tries to do everything in a single model is usually leaving capability on the table. For the cost side of running two models, see price per use case.
The pricing supports this architecture. Gemini 3.1 Pro Preview's vision tier costs less per image than Claude or GPT-5, so routing the vision pass to it happens to be the cheaper option too. The frontier labs will likely close this gap within the next two releases, but for now it's real.
Gemini 3.1 Pro Preview is the best multimodal model in early 2026. The margin is large and consistent across image types, and it widens further on dense user interfaces, document images, and Arabic-language visual content. Where vision is the central job, the choice isn't close.
Claude Opus 4.7 is the right model for a workload that handles vision as one capability among many, where consistency matters more than the peak on any single category. GPT-5's strength is narrower but real: aesthetic and atmospheric description. Llama 4 Maverick isn't yet competitive on visual tasks, and there's little reason to reconsider it before its next major release.
If image work matters to your stack, run Gemini 3.1 Pro Preview for that pass and keep your default model for everything else.