benchr Issue No. 07

Open-weight AI models: a guide

Llama 4, Mistral Large 2, DeepSeek-V3.1, Qwen 3, plus the small-model tier and what it takes to run them yourself.

What this guide covers

This guide pulls together benchr's coverage of the open-weight tier in 2026: the frontier-class open models (Llama 4, Mistral Large 2, DeepSeek-V3.1, Qwen 3), the small-model tier (Phi-4 mini, Gemma 3, the new Phi variants), and the hardware question — what does it actually take to run any of these yourself, and when is that the right call instead of paying for an API.

The frontier-open tier

The small-model tier

  • Review · Feb 2026

    Small language models, in working use

    Phi-4 mini, Gemma 3, and the workloads where sub-10B parameter models quietly win. 96% classification accuracy on a 1,200-email test set — at one-tenth the cost of the frontier.

Running them yourself

  • Essay · Mar 2026

    Running models on your own machine

    Hardware, software, real tokens-per-second on three quantizations. When local is worth it versus paying for an API. 220 tok/s for Phi-4 mini on an M3 Max.

Which open model should you use?

For most production workloads where you need an open-weight model with a permissive license, Qwen 3 235B MoE is the default pick: Apache 2.0, broad multilingual range, code understanding that's competitive with closed mid-tier models, and a manageable hosting footprint at the MoE configuration.

If you need the best open-weight math and code performance and your data-residency story can accept the DeepSeek License v3 terms, DeepSeek-V3.1 beats Qwen on those specific benchmarks and is the cheapest hosted endpoint in the field.

For small-model workloads — classification, extraction, routing — start with Phi-4 mini (3.8B, MIT). It fits in 16GB of RAM, runs at 100+ tokens per second on a consumer laptop, and hits 94% accuracy on the email-classification test against a 96% Sonnet 4.6 baseline.

For the open-vs-closed cost question, see the AI costs guide. The open-weight tier hits roughly 70% of frontier capability for about 10% of the price, but the gap is sharper on agent loops and long-context retrieval than on the headline benchmarks.