Which AI model has the best value in 2026?

The intelligence-vs-price quadrant shows DeepSeek V4-Pro and Gemini 3.5 Flash in the 'cheap + capable' zone. They deliver frontier-grade capability at a fraction of the cost of Claude Opus or GPT-5.5.

How does the benchmark explorer work?

Drag the sliders to set how much you care about coding, reasoning, writing, vision, long context, and multilingual. The list re-ranks all 19 models in real time using a weighted average of those capability scores.

Charts · Updated June 2026

AI model charts

Two ways to find the right model. The scatter shows where every model sits on intelligence vs price. The explorer lets you reweight dimensions to match your actual workload.

Data from models.json Rankings computed from data — never paid placements

Intelligence vs price

Each dot is a model. Y-axis: a 0–100 capability score (coding 40% + reasoning 40% + writing 20%), where coding is SWE-bench Verified and reasoning is GPQA Diamond — official figures where the provider published them, benchr estimates otherwise. X-axis: blended price per million tokens (input + output average). Top-left = most capability per dollar. Click any dot to read the review.

Show labels (auto-spaced to never overlap)

Axis ranges scale to the data — no model is clamped to an edge. Pricing from official provider docs (June 2026); self-hosted / free models are staggered just off the $0 axis so their labels don't pile up.

Benchmark explorer

Drag the sliders to weight what matters to you. The ranking updates instantly. Zero out anything you don't care about — coding-only shops can drop writing and multilingual to zero. Weight a dimension a model has no data for (e.g. Vision for a text-only model) and it's scored 0 there, so the ranking honestly reflects your weights.

Coding 40%

Reasoning 40%

Writing 20%

Vision 0%

Long context 0%

Multilingual 0%

Loading…

How this score is computed. Each model's number is a weighted average (0–100) of the dimensions you've set above. Coding uses SWE-bench Verified and Reasoning uses GPQA Diamond — real, sourced benchmarks (official figures where the provider published them, a benchr estimate where they didn't, marked "est"). Writing, Vision, Long context, and Multilingual are benchr editorial ratings, not lab benchmarks — see the methodology. A dimension a model has no data for counts as 0, never dropped. Bars are scaled to the current leader; scores show one decimal so close models stay distinct.

Also useful

→ Full ranked index with benchr Rating → Cost calculator — real monthly cost by usage → Model recommender — answer three questions → Side-by-side comparison

Frequently asked questions

Which AI model has the best capability-per-dollar?

In June 2026: DeepSeek V4-Pro tops the scatter chart — near-frontier coding and reasoning at under $0.44/1M input. Gemini 3.5 Flash follows, with strong multimodal capability at $1.50/$9.00. Both sit in the "cheap + capable" top-left zone.

What does the Y-axis (capability) measure?

The default Y-axis uses coding (40%) + reasoning (40%) + writing (20%) from benchr's editorial capability scores. Use the sliders in the explorer to change the weighting to your workload.

Why are Llama/Phi plotted at zero price?

They're free open-weight models. If you self-host, the API cost is $0 — your cost is infrastructure, not tokens. That puts them at the left edge of the chart.

Can I embed or share these charts?

Yes. All charts read from the public models.json file. You can iframe any tool page directly; the JSON data is free to use under CC BY 4.0.