1,200 support emails. 18 categories. Phi-4 mini classified them locally at 94% accuracy. Claude Sonnet 4.7 over the API scored 96% on the same set. Two-point gap. The local side cost nothing. The API side would've cost about $16 a day at that volume.
One comparison isn't the case for small language models. The case is the dozens of similar comparisons that play out the same way in real workloads every day. For workloads that involve classification, extraction, routing, or anything else with a tight latency budget and a forgiving accuracy ceiling, Phi-4 mini and Gemma 3 deserve a careful look the frontier-model discourse rarely gives them.
This piece covers the best sub-10B-parameter models at the start of 2026, plus a worked example with real timing data. These are the models worth building a production pipeline on when cost, privacy, or latency dominate the constraints. They're the wrong models when the workload depends on multi-step reasoning or wide world knowledge.
(A side note before the category boundaries: I almost included Mistral 7B in this piece because it still shows up in production at companies running self-hosted infrastructure. It got cut because the current open-weight tier — Phi-4 mini, Gemma 3 9B, Qwen 3 7B — has materially better quality and the same hardware footprint. If you're still running Mistral 7B in 2026 and the work is going well, you're probably fine. But the upgrade path is real.)
What "small" means here
Anything under 10B parameters. Microsoft ships Phi-4 (Azure blog) at 14B and Phi-4 mini at 3.8B, with the mini model card on Hugging Face. Google's Gemma 3 has a 9B and a 27B. Qwen 3 (qwen.ai) has small variants down to 1.5B. The 4B-to-9B band is the sweet spot. It fits in 16GB of RAM with sensible quantization and runs fast enough on a recent laptop to feel interactive. For the hardware side of running this yourself, see running models on your own machine.
On the frontier scale, a 4B model is a rounding error. Claude Opus 4.7 is several hundred times larger by parameter count and many orders of magnitude larger by training cost. Catching up to the frontier isn't the goal. Competence on a narrow band of tasks is. Tasks where frontier capability isn't needed and where small means way faster, way cheaper, and easier to control.
I went into this piece expecting Gemma 3 9B to clearly beat Phi-4 mini on multilingual work — that's the marketing framing for both products. Gemma did win the multilingual scoring. The gap was smaller than I expected. Phi-4 mini's performance on Spanish and French structured-extraction tasks was close enough to make the size advantage matter more than I'd predicted.
Three workloads where small models beat the frontier
Classification and extraction. The email example above generalizes. For routing, triage, and structured extraction, small models hit roughly 94% of frontier accuracy at one-tenth the cost. The two-point gap isn't worth ten times the API bill.
Routing and triage. The model decides where a request should go. Which API to call, which downstream model to invoke, which template to apply. Small models excel here because the task is simple, the latency budget is tight, and the cost of getting it wrong is recoverable. A small model in the router slot lets you save the frontier models for the requests that actually need them.
On-device or private inference. Anything where the data can't leave the device — health records, internal corporate documents, anything covered by a strict residency rule — and where the capability ceiling is acceptable. A 9B Gemma 3 running locally on a corporate laptop is more useful than a frontier model your team isn't allowed to use.
Where small loses
Multi-step reasoning. Ask a 4B model to chain three or four logical steps and the failure rate jumps sharply. The model can do each step individually. It just loses coherence across the chain. Frontier models hold the chain together more reliably, which matters a lot for any task that requires planning.
World knowledge. Small models simply know less. Ask Phi-4 mini an obscure question about regional tax regulations or the history of a niche programming language, and the answer will be confident, smooth, and often wrong. That's the area where parameter count maps most directly to knowledge breadth. No clever workaround.
Long-context retrieval. Most small models advertise 128K-token windows, and the retrieval quality at the high end of that window is way worse than the frontier models. For any work that needs deep reasoning over a long document, a small model is the wrong tool. The context-window piece covers the long-context picture in detail.
The 4B sweet spot isn't the frontier. It's the workhorse. The frontier models are for the problems no workhorse can carry.
The two worth picking
Phi-4 mini (Microsoft, released late 2025) at 3.8B parameters. The strongest small model on structured reasoning and instruction-following at that size. Microsoft's training-data strategy (synthetic data filtered for educational value) is a real edge on tasks where the input looks like a textbook problem or a structured business document. The license (MIT) is the cleanest available.
Then there's the larger Phi-4 at 14B. It's not in the "small" bucket strictly, but it sits on the boundary and is worth pairing with mini if your workload mixes simple and structured-reasoning tasks. Same MIT license.
Gemma 3 9B (Google, released October 2025). Best raw multilingual capability in the small-model class, including clearly better Arabic than expected. The Gemma Terms license is permissive enough for commercial use with sensible restrictions. The instruction-tuned variant follows specified formats more reliably than the base.
| Model | Params | License | Best at |
|---|---|---|---|
| Phi-4 mini | 3.8B | MIT | Classification, extraction, structured tasks |
| Phi-4 | 14B | MIT | Structured reasoning at the edge of "small" |
| Gemma 3 9B | 9B | Gemma Terms | Multilingual workhorse, on-device chat |
| Qwen 3 7B | 7B | Apache 2.0 | Code in small footprint, Chinese |
| Llama 4 8B | 8B | Llama 4 Community | General-purpose, ecosystem familiarity |
Classification
Phi-4 mini Email, support ticketsExtraction
Phi-4 mini Structured fields from textRouting
Phi-4 mini Decide which API to callSummarization
Phi-4 Short docs, single passMultilingual
Gemma 3 9B, Arabic-decentCode helper
Qwen 3 7B Coding small-footprintEmail, ticket, document, query.
Phi-4 mini classifies + decides path.
~90% of cases. Zero API cost.
~10% of cases. Pay for what matters.
Worth flagging: the small-model accuracy numbers are workload-specific. For inbox triage with 18 well-defined categories, Phi-4 mini hits 94%. For free-form sentiment analysis on social media text — closer to fine-grained nuance — I've seen the same model drop to 78%. The 94% number is a ceiling, not a floor.
A concrete production scenario
A real example to make the trade less abstract. An inbox-classification pipeline that previously ran through Claude Sonnet 4.7 got rebuilt to run on Phi-4 mini locally. The setup classifies incoming emails into 18 priority categories.
Before-and-after numbers:
| Metric | Sonnet via API | Phi-4 mini local |
|---|---|---|
| Cost per email | ~$0.004 | ~$0 (electricity) |
| End-to-end latency | ~800 ms | ~60 ms |
| Accuracy vs. human labels | 96% | 94% |
| Data leaves premises | Yes | No |
Accuracy dropped two points. Cost dropped to basically zero. Latency dropped by more than an order of magnitude. The data residency story changed from "leaves the network" to "stays put."
For this workload, the trade is obvious.
For a sales-lead routing system where a misclassification has dollar consequences, the trade would tip the other way and the frontier API would stay. Small models open a different operating point on the cost-accuracy curve. The right question isn't which model is better. It's which operating point fits the workload. For the broader pricing picture across workloads, see price per use case.
One gap worth naming. None of these small models were fine-tuned on workload-specific data, which would probably close part of the accuracy gap on the support-email task. Possibly enough to recover the two-point drop. The multimodal variants weren't tested here either. Both are open questions for follow-up.
Small language models aren't the future of frontier capability. They're the future of production AI infrastructure.
The workloads they're winning — classification, extraction, routing, on-device inference — are exactly the workloads that account for most of the API spend in real businesses. A company running millions of inference calls a day through a frontier model when 90% of those calls could be served by a 4B model is leaving real money on the table.
The right architecture for any organization with serious volume is two-tier. A frontier model for the requests that justify it. A small model (fine-tuned where useful) in front of every other layer. The cost dynamics and the latency wins are too big to ignore once volume is real.
For English-only structured work, go with Phi-4 mini. For multilingual work, go with Gemma 3 9B. Both are good enough that the question isn't whether to use them. It's where in the stack to use them. The frontier models keep the prestige. The small models do the work.