Most multimodal model coverage focuses on whether a model can describe an image. That's the wrong test. By 2026, all four frontier models can describe images competently. The interesting question is which model can read structure — dense UIs, document scans, hand-drawn diagrams, mixed-script content — and reason about what it sees. Twelve images and four models later, the answer is clearer than the leaderboards make it look.
The image set was deliberately uneven. Four screenshots from working software. Four photographs taken in real conditions. Four document images of the kind that come out of phones and scanners and casual captures. Same prompt to each model on each image: describe what you see, then answer the specific question I am asking about it. Scoring on a 5-point scale. 5 means correct and complete. 1 means wrong or refused. The directional results aren't subtle.
The four models tested: Claude Opus 4.7, GPT-5 (vision variant), Gemini 3.1 Pro Preview, and Llama 4 Maverick (which gained a vision capability in its November 2025 update). The image set was assembled to look like what comes across a working desk on a normal week, not a curated benchmark. For the broader picture on Gemini 3.1 Pro Preview, see the Gemini 3 Pro review.
Worth naming a limit up front: twelve images is a small sample. A confident statistical claim would need fifty or more. The directional results are clear enough to publish, but the exact rank-order between Claude and GPT-5 (46 vs 44) could flip on a different test set. Gemini's win is wide enough to survive sample-size criticism. The middle of the pack isn't.
The image set
Four screenshots: a dense administrative settings panel with 40+ controls across three tabs; a payment-dashboard view showing a complex refund situation; a code editor window with a deliberate compile error highlighted; a hand-drawn whiteboard diagram of a state machine, photographed with a phone camera.
Four photographs: a damaged shipping container photographed at a port (the question was about the visible damage pattern and its cause); a close-up of a vintage mechanical wristwatch with the seconds hand stopped at an unusual angle; a busy urban intersection during evening rush hour (the question asked about the time of day from visible cues); a poorly-lit photograph of a circuit board with text in two scripts on the silkscreen.
Four document images: an Arabic-language municipal flyer with mixed RTL and LTR content; a scanned pharmacy receipt with the amounts smudged; a French-language wine list photographed at an angle; a page from a 1960s engineering manual with handwritten marginalia.
The scoreboard
Walking through the twelve image tasks in prose instead of a grid: Gemini 3.1 Pro Preview was the clear winner on the administrative settings panel — every visible control identified correctly, plus two visual inconsistencies the design team had shipped without noticing. On the payment dashboard, all three frontier models tied at competent; Llama 4 fell behind. The code editor with a compile error: Claude tied with Gemini at the top. The hand-drawn whiteboard diagram was Gemini's clearest win — it parsed the marginal annotations that Claude and GPT-5 both missed.
On the photographs, GPT-5 took two: the vintage watch and the urban intersection, both on aesthetic description. The damaged shipping container was a four-way tie. The bilingual circuit board went to Gemini on the script-mixing question. The smudged pharmacy receipt was where Gemini's lead widened — it recovered amounts that none of the other models could read. The French wine list went to GPT-5. The 1960s engineering manual was close. The Arabic flyer was Gemini's most decisive win in the document category.
The total: Gemini 53, Claude 46, GPT-5 44, Llama 4 29 out of 60 possible. In my testing, Gemini's lead on document-heavy and structured-content tasks is wide enough to survive a different sample. The middle-of-the-pack ordering between Claude and GPT-5 is closer than the absolute numbers suggest.
Gemini 3 Pro won. By a lot. The 53-of-60 score is a real margin over the second-place Claude Opus 4.7 at 46. GPT-5 came in third at 44. Llama 4 Maverick's vision capability is competitive only on the easier tasks and falls apart on dense images, RTL text, and anything with low contrast.
I expected Claude to win the code-editor-with-error test. It tied with Gemini. The surprise was that GPT-5's vision pass on the same image flagged the wrong line of code — a confident-but-wrong call exactly like the failure mode of GPT-5 on text-only debugging. Whatever's causing that pattern isn't isolated to text.
Five tasks Gemini wins clearly
The dense-UI tests showed the widest gap. The administrative settings panel has roughly 40 controls in three tabs, with subtle states (some greyed out, some in indeterminate states, some with non-default values). Gemini correctly identified every visible control, named the state of the toggles, and noticed two visual inconsistencies the design team had actually shipped. Claude and GPT-5 each missed something. Llama 4 mis-described several controls and invented one that wasn't there.
The Arabic flyer test was also decisive. Gemini read the Arabic text correctly, identified the embedded Latin-script English headline, and answered the specific question about the event date. Claude got the date right but mis-translated one caption. GPT-5 conflated two captions. Llama 4 recognized that the document was Arabic and produced a rough description without reading the words. For the language side of this, see AI for Arabic content.
The whiteboard diagram test was the most surprising. Gemini correctly read the arrow directions, parsed the small marginal annotations, and translated the diagram into a structured description that could've been used as documentation. Claude and GPT-5 each got the structure roughly right and missed at least one marginal note. Llama 4 didn't recognize that the photograph was of a diagram.
What the others got right
GPT-5 was strongest on aesthetic and atmospheric description. The urban intersection asked the models to describe the scene and identify roughly the time of day. GPT-5 produced the most evocative description and correctly identified the late-evening hour from the long shadows. The vintage watch and the French wine list both went to GPT-5 for similar aesthetic-judgment reasons.
Claude Opus 4.7 was the most reliable across the board. It didn't win any single category outright, but it never dropped below a 3 on any image. That kind of consistency is its own virtue. For a variable workload, the variance matters as much as the peak.
Llama 4 Maverick's vision capability is what you'd expect from a model that added vision late in its lifecycle. It works on easy cases. It struggles on hard ones. Included here for completeness, but skip it for any vision task in production.
If your work involves screenshots, document images, or anything with dense text and structure, Gemini 3 Pro isn't a marginal upgrade. It's a different tool.
One honest admission: I can't fully explain why Google's Arabic-script vision is so much stronger than the alternatives. The most likely answer is Translate's image-translation dataset, which has had more Arabic exposure than anyone else's training pipeline. But that's a hypothesis, not something Google has confirmed publicly.
The Arabic-script pattern
Three of the twelve tests involved Arabic or mixed-script imagery, and on all three Gemini scored higher than the others by a margin that suggests this is a persistent strength rather than test-set variance. The most likely explanation is that Google's training pipeline includes a more curated Arabic vision dataset, possibly through Translate's image-translation product. Whatever the cause, the pattern is consistent enough to call out. For Arabic-language visual content — receipts, documents, flyers, mixed scripts — Gemini 3.1 Pro Preview is the model to deploy.
Methodology limits
Twelve images is a small sample. A confident statistical claim would need fifty or more. The score per image is one reviewer's judgment, calibrated to what that reviewer considers correct and complete. Each image was tested once per model and not re-run. The order of testing was randomized but not double-blind.
That said, the gap between Gemini and the second-place model is large enough that the sample-size issue doesn't change the directional conclusion. The interesting follow-up would be to test specific image categories more deeply. For example, fifty different RTL-script document images, to confirm whether the Gemini edge holds at scale.
Admin UI
Gemini Dense control readingDashboard
Gemini Multi-chart parsingWhiteboard photo
Gemini Arrows + handwritingArabic flyer
Gemini RTL text + Latin scriptSmudged receipt
Gemini Low-contrast OCRVintage watch
GPT-5 Aesthetic descriptionFrench wine list
GPT-5 Atmospheric framingUrban photo
GPT-5 Time-of-day inferenceWhat this means for production stacks
For any product that ingests images at real volume, the right architecture in 2026 is a two-model split. Gemini 3.1 Pro Preview handles the vision pass. The downstream reasoning runs on whichever model is best for that task. Vision-strong model for parsing, reasoning-strong model for thinking. Most multimodal pipelines that try to do everything in one model are leaving capability on the table. For the cost side of running two models, see price per use case.
The pricing supports this architecture. Gemini 3.1 Pro Preview's vision tier costs less per image than Claude or GPT-5, which means using it specifically for the vision pass is also the cheaper pick. The frontier labs will close this gap in the next two releases. For now, the gap is real.
Gemini 3.1 Pro Preview is the best multimodal model in early 2026. The margin is large, consistent across image types, and especially pronounced on dense user interfaces, document images, and Arabic-language visual content. For workloads where vision is the central job, the choice isn't close.
Claude Opus 4.7 is the right model for a workload that handles vision as one capability among many, where consistency matters more than peak performance on any single category. GPT-5 wins on aesthetic and atmospheric description, which is a narrow but real strength. Llama 4 Maverick isn't yet competitive on visual tasks. The next major release is worth waiting for before reconsidering.
If image work matters to your stack, run Gemini 3.1 Pro Preview for that pass and keep your default model for everything else.