benchr Issue No. 07

benchr · Issue No. 07

A reference for AI model selection

Pricing, benchmarks, and use-case fit for Claude, GPT, Gemini, Llama, Mistral, and the open-weight tier — sourced from official provider documentation.

Reviews

  1. 01

    Claude Opus 4.7

    Strong on architecture decisions and long-document analysis. Vision is its weaker side.

  2. 02

    GPT-5

    The visual-design winner. Still strong on math and structured output.

  3. 03

    Gemini 3 Pro

    Excels at vision + reasoning. Average elsewhere. Deprecated by Google in March 2026 in favor of 3.1 Pro.

Comparisons

  1. 04

    Coding assistants

    Cursor, GitHub Copilot, Windsurf, and Cody on the same Markdown-exporter task.

  2. 05

    Voice models

    ElevenLabs, OpenAI Whisper, and Cartesia Sonic on latency, accuracy, and naturalness.

  3. 06

    Context windows

    Advertised window versus the retrieval zone where the model actually finds information.

Analysis

  1. 07

    Price per use case

    The cheapest model for chat, coding, RAG, agents, classification, and summarization.

  2. 08

    The open-weight tier

    Where Llama 4, Mistral Large 2, DeepSeek-V3.1, and Qwen 3 stand against the closed labs.

  3. 09

    Why benchmarks stopped telling you

    MMLU is saturated above 90%. The benchmarks worth tracking now.

Compare models directly

An interactive reference covering pricing, benchmarks, context windows, and capability ratings for 14 frontier and open-weight models.

Open the comparison tool

All 18 articles in the archive →

Recent model releases →