What this guide covers
This guide pulls together benchr's coverage of the open-weight tier in 2026: the frontier-class open models (Llama 4, Mistral Large 2, DeepSeek-V3.1, Qwen 3), the small-model tier (Phi-4 mini, Gemma 3, the new Phi variants), and the hardware question — what does it actually take to run any of these yourself, and when is that the right call instead of paying for an API.
The frontier-open tier
-
The open-weight tier right now: Llama 4, Mistral, Qwen, DeepSeek
Where open weights have caught up to closed models, and the two categories where they still haven't. License clarity, code quality, multilingual range, and the cost-performance frontier.
The small-model tier
-
Small language models, in working use
Phi-4 mini, Gemma 3, and the workloads where sub-10B parameter models quietly win. 96% classification accuracy on a 1,200-email test set — at one-tenth the cost of the frontier.
Running them yourself
-
Running models on your own machine
Hardware, software, real tokens-per-second on three quantizations. When local is worth it versus paying for an API. 220 tok/s for Phi-4 mini on an M3 Max.
Which open model should you use?
For most production workloads where you need an open-weight model with a permissive license, Qwen 3 235B MoE is the default pick: Apache 2.0, broad multilingual range, code understanding that's competitive with closed mid-tier models, and a manageable hosting footprint at the MoE configuration.
If you need the best open-weight math and code performance and your data-residency story can accept the DeepSeek License v3 terms, DeepSeek-V3.1 beats Qwen on those specific benchmarks and is the cheapest hosted endpoint in the field.
For small-model workloads — classification, extraction, routing — start with Phi-4 mini (3.8B, MIT). It fits in 16GB of RAM, runs at 100+ tokens per second on a consumer laptop, and hits 94% accuracy on the email-classification test against a 96% Sonnet 4.6 baseline.
For the open-vs-closed cost question, see the AI costs guide. The open-weight tier hits roughly 70% of frontier capability for about 10% of the price, but the gap is sharper on agent loops and long-context retrieval than on the headline benchmarks.