benchr — A reference for AI model selection

Featured

GPT-5 versus Claude Opus 4.7 across seven coding and writing tasks

A head-to-head comparison covering code review, landing-page copy, citation accuracy, recipe writing, and email drafting. Claude leads on five of the seven; the two GPT-5 wins are not small.

Comparison · April 28, 2026

Reviews

01
Claude Opus 4.7

Strong on architecture decisions and long-document analysis. Vision is its weaker side.
02
GPT-5

The visual-design winner. Still strong on math and structured output.
03
Gemini 3 Pro

Excels at vision + reasoning. Average elsewhere. Deprecated by Google in March 2026 in favor of 3.1 Pro.

Comparisons

04
Coding assistants

Cursor, GitHub Copilot, Windsurf, and Cody on the same Markdown-exporter task.
05
Voice models

ElevenLabs, OpenAI Whisper, and Cartesia Sonic on latency, accuracy, and naturalness.
06
Context windows

Advertised window versus the retrieval zone where the model actually finds information.

Analysis

07
Price per use case

The cheapest model for chat, coding, RAG, agents, classification, and summarization.
08
The open-weight tier

Where Llama 4, Mistral Large 2, DeepSeek-V3.1, and Qwen 3 stand against the closed labs.
09
Why benchmarks stopped telling you

MMLU is saturated above 90%. The benchmarks worth tracking now.

Compare models directly

An interactive reference covering pricing, benchmarks, context windows, and capability ratings for 14 frontier and open-weight models.

Open the comparison tool

All 18 articles in the archive →

Recent model releases →

GPT-5 versus Claude Opus 4.7 across seven coding and writing tasks

Reviews

Claude Opus 4.7

GPT-5

Gemini 3 Pro

Comparisons

Coding assistants

Voice models

Context windows

Analysis

Price per use case

The open-weight tier

Why benchmarks stopped telling you

Compare models directly