"Open-weight has caught up." That's the take heard most often in 2026, usually from people who haven't tried running an open model in a serious production workflow. The opposite take, that open-weight is still years behind, comes from people who haven't looked at the leaderboards in a while. Both fail the same way: they average across categories that have moved at very different speeds. Open weights have closed the gap on general conversational use, on isolated code generation, and on high-resource multilingual reasoning. Where they remain behind is long-context retrieval at extreme scale and reliable tool use inside agent loops. The four models covered here (Llama 4 in Maverick and Scout configurations, Mistral Large 2, Qwen 3, DeepSeek-V3.1) show the unevenness in different ways.
The lineup: Llama 4 in its two main shipped configurations (the 400B-parameter mixture-of-experts Maverick (about 17B active) and the 109B mixture-of-experts Scout), Mistral Large 2 at 123B dense parameters, Qwen 3 in both the 72B dense and 235B MoE variants, and DeepSeek-V3.1 at 671B MoE with roughly 37B active per token. The verdicts below are grounded in each lab's documented capability claims and the consistent open community discussion of how each model behaves in production. If you want to run any of these on your own hardware, see running models on your own machine.
One caveat before the lineup. The headline benchmark improvements on DeepSeek-V3.1 vs V3.0 (especially on math) are genuine and reproducible on the published evals. Whether that translates into the same delta on your specific math workload (finance modeling, hard reasoning under uncertainty) is the kind of thing only your own test can tell you. The leaderboard gap tends to look larger than what shows up at the workload level.
Llama 4
Meta released Llama 4 in April 2025, per Meta AI's Llama 4 announcement, in two main configurations. Maverick is the 400B-parameter mixture-of-experts model (about 17B active per token) intended for serious GPU deployment. Scout is the 109B mixture-of-experts variant that activates roughly 17B parameters per forward pass and runs on a single 80GB H100 with sensible quantization. Weights and downloads are available at llama.com.
Maverick is the strongest open-weight model on hard reasoning tasks at the end of 2025. On chain-of-thought problems and structured multi-step reasoning, it's the open-side model to reach for. Scout is the workhorse, trading some peak capability for far more accessible hardware requirements. The instruction-tuned variants of both are good, if slightly less refined than the base models, a pattern consistent with how Meta has been tuning lately.
The license is the Llama 4 Community License: permissive for almost everyone, with a clause forbidding use by services with more than 700M monthly active users. That clause is irrelevant to a small team or a solo developer. At a large company, read the license carefully against the specific deployment context.
Mistral Large 2 is the production release in May 2026, but Mistral has been hinting at a successor for months. By the time you read this, there may be a Large 3 or equivalent. The methodology here applies regardless: score the new model against the same tests when it ships.
Mistral Large 2
Released July 2024 at 123B dense parameters, per Mistral's Large 2 announcement. This is still the open-weights lab with the strongest house style: a clean preference for structured output and a willingness to commit to opinions instead of hedging endlessly. Its European-language work is clearly stronger than the alternatives. The context window sits at 128K tokens, but what context the model has is unusually well-used.
The license is the Mistral Research License. Permissive for research and personal use, with separate commercial terms required for paid deployments. It's not as clean as Apache 2.0, but the terms are straightforward and predictable. If your deployment is internal and non-commercial, you can use Mistral Large 2 today without further negotiation. For commercial use, contact Mistral.
Qwen 3
Alibaba released Qwen 3 in April 2025 across several variants, with the official rundown at qwen.ai and model cards hosted on Hugging Face under the Qwen organization. The two worth your attention are the 72B dense model and the 235B MoE that activates about 22B parameters per token. Qwen 3 is the strongest open-weight model on Chinese-language work, and one of the better ones on Arabic and Japanese. The code understanding is competitive with mid-tier closed models in a way that surprises anyone who only knows Qwen by its earlier reputation.
Most variants ship under Apache 2.0, the cleanest license in the lineup. The model's instruction-following tends to drift back to its preferred output shape after a few turns of conversation, which is a real limitation in agentic workflows. On single-shot or short-conversation use, though, the quality matches or beats the alternatives across most categories.
DeepSeek-V3.1
Released in late 2025 as a refinement of the V3 line, with releases and docs at deepseek.com. The V3.1 update sits at 671B-parameter mixture-of-experts that activates roughly 37B parameters per forward pass. DeepSeek has built the most aggressive open-weights story of any current lab: detailed technical reports, model cards that publish the numbers instead of marketing language, and hosted-endpoint pricing way below the Western alternatives.
For coding and math, DeepSeek-V3.1 is competitive with Claude Opus 4.7 on isolated tasks, and its reasoning quality on math problems is the strongest in the open-weights field. The English-language writing sits at the top tier too. The weaknesses are tool use, which is less reliable than the closed alternatives, and safety-tuning depth: refusals are clearly lighter than what Western users may expect from a frontier model.
The license is the MIT-style DeepSeek License. Permissive with use-case restrictions worth reading if the deployment touches anything sensitive.
Three places where open has caught up
Three categories where the open-weight tier is close enough to closed models that capability shouldn't decide it. License, cost, and deployment preferences should.
General knowledge and conversational reasoning at typical lengths. The top open-weight models are within striking distance of the closed frontier on chat-style use, factual questions, and structured reasoning that fits in a single context window. The leaderboards capture this accurately, even if they miss the categories further down. For more on the leaderboard problem, see why benchmarks stopped telling you anything.
Code generation on isolated tasks is the second. Given a self-contained programming problem with clear requirements, DeepSeek-V3.1 and Qwen 3 produce output that matches the closed models in quality most of the time. The gap only shows up at architectural scale, on multi-file refactors and the design decisions that span a production codebase. For the bread-and-butter task of writing a competent function, the open models are good enough.
The third is multilingual capability in high-resource languages. The top open models compete strongly across European languages, Chinese, Japanese, and increasingly Arabic, and Qwen 3 specifically pushes the Chinese frontier ahead of any closed model you can buy. For organizations doing serious multilingual work, the open-weight tier has become a genuine first choice rather than a fallback.
The capabilities open weights are slowest to match are exactly the ones the closed labs poured the most engineering into. The hardest gaps to close are the ones worth the most money.
Two places where closed still wins
Two categories where the open-weight tier is clearly behind the closed alternatives. For serious production deployments here, stick with closed.
The first is long-context retrieval at extreme scale. The closed models (Claude Opus 4.7, GPT-5, Gemini 3.5 Flash, Gemini 3.1 Pro) have put enormous engineering effort into making their million-token contexts usable: recall stays high, hallucinations stay low, and the model will quote rather than summarize when asked. Open-weight models with similar nominal context windows show clear drops past the 500K-token mark. Recall falls off, false synthesis creeps in, and the gap to closed-model performance widens with every additional 100K tokens of input.
The second is reliable tool use and agent behavior. The closed labs have spent the better part of a year tuning their frontier models to behave consistently inside agent loops: call this tool, parse the response, decide the next action, recover gracefully from errors. Open-weight models can do all of this in principle, but in practice they need a lot more scaffolding to stay on task and recover from tool failures without getting stuck. For any production workflow that involves multi-step tool use, the closed models stay clearly ahead.
Llama 4 Maverick
400B Community License · ReasoningMistral Large 2
123B Research License · EU langsQwen 3 235B MoE
235B Apache 2.0 · MultilingualDeepSeek-V3.1
671B MIT · Code + math-
Feb 2024
Mistral Large
First serious open-weight competitor to GPT-4.
-
Jul 2024
Llama 3.1 405B
Meta's first frontier-class open model.
-
Dec 2024
DeepSeek-V3
Open MoE that closed the cost gap.
-
Aug 2025
Qwen 3 235B
Apache-licensed, strong Arabic and Asian language support.
-
Sep 2025
Llama 4 Maverick / Scout
Frontier reasoning + 10M context tier.
-
Dec 2025
DeepSeek-V3.1
Refinement of V3, even tighter code+math benchmarks.
The comparison table
| Model | Parameters | License | Best at | Avoid for |
|---|---|---|---|---|
| Llama 4 Maverick | 400B MoE | Llama 4 Community | Hard reasoning, top open tier | Agent loops, long-doc retrieval |
| Llama 4 Scout | 109B MoE | Llama 4 Community | Single-GPU deployment | Anything needing top accuracy |
| Mistral Large 2 | 123B dense | Mistral Research | European languages, voice | Long context, multi-file code |
| Qwen 3 235B MoE | 235B (32B active) | Apache 2.0 | Chinese, multilingual, code | Strict format compliance |
| DeepSeek-V3.1 | 671B (37B active) | MIT-style | Code, math, cost-sensitive use | Safety-critical applications |
Granite (IBM's openly-licensed line) and Phi (Microsoft's small-model family) aren't in this survey. Granite is solid for enterprise text work but doesn't compete at the frontier. Phi gets its own piece in the small-model review.
The decision rule
If you're building anything that has to run inside a regulated environment with no data leaving your network, open weights are effectively the only option on the table. Whatever capability gap exists is worth absorbing to avoid the compliance problem of sending data to a closed API.
If the unit economics of your workload are dominated by per-token cost (high-volume inference, batch document processing, anything serving thousands of requests per minute), DeepSeek-V3.1 on a hosted endpoint or Qwen 3 on your own hardware will beat the closed alternatives by an order of magnitude on dollars per query.
If your workload depends on the model reliably calling tools, navigating agent loops, or maintaining coherence across hundreds of thousands of tokens, stay on closed. The gap is genuine and it isn't closing as fast as the headline capability gap.
When there's no strong prior either way, prototype on a closed model for development speed, then re-test the production path on Qwen 3 235B or DeepSeek-V3.1 before scaling. Often the open model will work fine and save you money that adds up over time. Often enough, you'll hit a specific failure mode that justifies the closed-model premium. Which way it breaks depends on the use case more than on any rule of thumb.
Open-weight models in late 2025 are good enough to be the right answer for most workloads that don't depend on long-context retrieval at extreme scale or on reliable agent behavior. The capability gap has closed on the bread-and-butter work: conversational use, isolated code generation, high-resource multilingual reasoning. And the license terms on Mistral and Qwen are clean enough for confident commercial deployment.
The two categories where closed still leads happen to be where most production money goes, and that overlap is no coincidence. The closed labs have prioritized the workflows that generate the highest-value revenue, and the open-weights labs have followed at a small but persistent distance. Whether that gap closes in 2026 depends mostly on whether the open-weights labs decide to focus on the same engineering work the closed labs have been doing for a year, which isn't yet clear.
If you have to pick one open-weight model for 2026 deployment, the default is Qwen 3 235B MoE. Its Apache 2.0 license, multilingual range, code competence, and architectural maturity make it the most versatile of the four. Others win on specifics: DeepSeek-V3.1 on raw cost-performance per the public reports, Llama 4 Maverick at the top end of reasoning, Mistral Large 2 on European languages and the cleanness of its prose. Let the workload pick, not the brand.