Review·February 2026

Small language models, in working use

Phi-4 mini, Gemma 3, and the workloads where sub-10B parameter models quietly win.

Updated May 25, 2026 · View changelog

Sweet spot 4–9B Parameter range tested

RAM needed 16GB For 4-bit quantization

Accuracy ceiling 96% vs human labels, classification

Cost per task $0.01 Electricity only, self-hosted

1,200 support emails. 18 categories. Phi-4 mini classified them locally at 94% accuracy. Claude Sonnet 4.7 over the API scored 96% on the same set. Two-point gap. The local side cost nothing. The API side would've cost about $16 a day at that volume.

One comparison isn't the case for small language models. The case is the dozens of similar comparisons that play out the same way in real workloads every day. For workloads that involve classification, extraction, routing, or anything else with a tight latency budget and a forgiving accuracy ceiling, Phi-4 mini and Gemma 3 deserve a careful look the frontier-model discourse rarely gives them.

This piece covers the best sub-10B-parameter models at the start of 2026, plus a worked example with real timing data. These are the models worth building a production pipeline on when cost, privacy, or latency dominate the constraints. They're the wrong models when the workload depends on multi-step reasoning or wide world knowledge.

(A side note before the category boundaries: I almost included Mistral 7B in this piece because it still shows up in production at companies running self-hosted infrastructure. It got cut because the current open-weight tier — Phi-4 mini, Gemma 3 9B, Qwen 3 7B — has materially better quality and the same hardware footprint. If you're still running Mistral 7B in 2026 and the work is going well, you're probably fine. But the upgrade path is real.)

What "small" means here

Anything under 10B parameters. Microsoft ships Phi-4 (Azure blog) at 14B and Phi-4 mini at 3.8B, with the mini model card on Hugging Face. Google's Gemma 3 has a 9B and a 27B. Qwen 3 (qwen.ai) has small variants down to 1.5B. The 4B-to-9B band is the sweet spot. It fits in 16GB of RAM with sensible quantization and runs fast enough on a recent laptop to feel interactive. For the hardware side of running this yourself, see running models on your own machine.

On the frontier scale, a 4B model is a rounding error. Claude Opus 4.7 is several hundred times larger by parameter count and many orders of magnitude larger by training cost. Catching up to the frontier isn't the goal. Competence on a narrow band of tasks is. Tasks where frontier capability isn't needed and where small means way faster, way cheaper, and easier to control.

I went into this piece expecting Gemma 3 9B to clearly beat Phi-4 mini on multilingual work — that's the marketing framing for both products. Gemma did win the multilingual scoring. The gap was smaller than I expected. Phi-4 mini's performance on Spanish and French structured-extraction tasks was close enough to make the size advantage matter more than I'd predicted.

Three workloads where small models beat the frontier

Classification and extraction. The email example above generalizes. For routing, triage, and structured extraction, small models hit roughly 94% of frontier accuracy at one-tenth the cost. The two-point gap isn't worth ten times the API bill.

Routing and triage. The model decides where a request should go. Which API to call, which downstream model to invoke, which template to apply. Small models excel here because the task is simple, the latency budget is tight, and the cost of getting it wrong is recoverable. A small model in the router slot lets you save the frontier models for the requests that actually need them.

On-device or private inference. Anything where the data can't leave the device — health records, internal corporate documents, anything covered by a strict residency rule — and where the capability ceiling is acceptable. A 9B Gemma 3 running locally on a corporate laptop is more useful than a frontier model your team isn't allowed to use.

Small models — email classification accuracy

Percent agreement against human labels, 1,200-email test set.

Phi-4 mini (3.8B)

94%

Phi-4 (14B)

95%

Gemma 3 9B

93%

Qwen 3 7B

92%

Claude Sonnet (API)

96%

Where small loses

Multi-step reasoning. Ask a 4B model to chain three or four logical steps and the failure rate jumps sharply. The model can do each step individually. It just loses coherence across the chain. Frontier models hold the chain together more reliably, which matters a lot for any task that requires planning.

World knowledge. Small models simply know less. Ask Phi-4 mini an obscure question about regional tax regulations or the history of a niche programming language, and the answer will be confident, smooth, and often wrong. That's the area where parameter count maps most directly to knowledge breadth. No clever workaround.

Long-context retrieval. Most small models advertise 128K-token windows, and the retrieval quality at the high end of that window is way worse than the frontier models. For any work that needs deep reasoning over a long document, a small model is the wrong tool. The context-window piece covers the long-context picture in detail.

The 4B sweet spot isn't the frontier. It's the workhorse. The frontier models are for the problems no workhorse can carry.

The two worth picking

Phi-4 mini (Microsoft, released late 2025) at 3.8B parameters. The strongest small model on structured reasoning and instruction-following at that size. Microsoft's training-data strategy (synthetic data filtered for educational value) is a real edge on tasks where the input looks like a textbook problem or a structured business document. The license (MIT) is the cleanest available.

Then there's the larger Phi-4 at 14B. It's not in the "small" bucket strictly, but it sits on the boundary and is worth pairing with mini if your workload mixes simple and structured-reasoning tasks. Same MIT license.

Gemma 3 9B (Google, released October 2025). Best raw multilingual capability in the small-model class, including clearly better Arabic than expected. The Gemma Terms license is permissive enough for commercial use with sensible restrictions. The instruction-tuned variant follows specified formats more reliably than the base.

Small open-weight models, benchr survey, January 2026
Model	Params	License	Best at
Phi-4 mini	3.8B	MIT	Classification, extraction, structured tasks
Phi-4	14B	MIT	Structured reasoning at the edge of "small"
Gemma 3 9B	9B	Gemma Terms	Multilingual workhorse, on-device chat
Qwen 3 7B	7B	Apache 2.0	Code in small footprint, Chinese
Llama 4 8B	8B	Llama 4 Community	General-purpose, ecosystem familiarity

96% Email classification accuracy — Phi-4 mini, local

Classification

Phi-4 mini Email, support tickets

Extraction

Phi-4 mini Structured fields from text

Routing

Phi-4 mini Decide which API to call

Summarization

Phi-4 Short docs, single pass

Multilingual

Gemma 3 9B, Arabic-decent

Code helper

Qwen 3 7B Coding small-footprint

1. Incoming work

Email, ticket, document, query.

↓

2. Small-model routing

Phi-4 mini classifies + decides path.

↓

3. Simple? Handle locally

~90% of cases. Zero API cost.

↓

4. Complex? Escalate to Opus

~10% of cases. Pay for what matters.

Worth flagging: the small-model accuracy numbers are workload-specific. For inbox triage with 18 well-defined categories, Phi-4 mini hits 94%. For free-form sentiment analysis on social media text — closer to fine-grained nuance — I've seen the same model drop to 78%. The 94% number is a ceiling, not a floor.

A concrete production scenario

A real example to make the trade less abstract. An inbox-classification pipeline that previously ran through Claude Sonnet 4.7 got rebuilt to run on Phi-4 mini locally. The setup classifies incoming emails into 18 priority categories.

Before-and-after numbers:

Sonnet via API vs. Phi-4 mini local, classification workload, January 2026
Metric	Sonnet via API	Phi-4 mini local
Cost per email	~$0.004	~$0 (electricity)
End-to-end latency	~800 ms	~60 ms
Accuracy vs. human labels	96%	94%
Data leaves premises	Yes	No

Accuracy dropped two points. Cost dropped to basically zero. Latency dropped by more than an order of magnitude. The data residency story changed from "leaves the network" to "stays put."

For this workload, the trade is obvious.

For a sales-lead routing system where a misclassification has dollar consequences, the trade would tip the other way and the frontier API would stay. Small models open a different operating point on the cost-accuracy curve. The right question isn't which model is better. It's which operating point fits the workload. For the broader pricing picture across workloads, see price per use case.

One gap worth naming. None of these small models were fine-tuned on workload-specific data, which would probably close part of the accuracy gap on the support-email task. Possibly enough to recover the two-point drop. The multimodal variants weren't tested here either. Both are open questions for follow-up.

Small language models aren't the future of frontier capability. They're the future of production AI infrastructure.

The workloads they're winning — classification, extraction, routing, on-device inference — are exactly the workloads that account for most of the API spend in real businesses. A company running millions of inference calls a day through a frontier model when 90% of those calls could be served by a 4B model is leaving real money on the table.

The right architecture for any organization with serious volume is two-tier. A frontier model for the requests that justify it. A small model (fine-tuned where useful) in front of every other layer. The cost dynamics and the latency wins are too big to ignore once volume is real.

For English-only structured work, go with Phi-4 mini. For multilingual work, go with Gemma 3 9B. Both are good enough that the question isn't whether to use them. It's where in the stack to use them. The frontier models keep the prestige. The small models do the work.

One framing note before the conclusion: the recommendations here are based on what I saw in the tests I ran during the period named above. The model landscape shifts fast — re-test before relying on these conclusions past the next quarterly release.

Bottom line

For classification, extraction, routing, and any high-volume low-stakes workload, the small-model tier wins on cost and latency. Phi-4 mini at 3.8B is the English-only default. Gemma 3 9B is the multilingual default. The frontier models stay relevant for complex reasoning and long-context work where small models still trail.

Frequently asked

Are small language models good enough for production?

Yes, for the workloads they're good at: classification, extraction, routing, structured-output generation. Phi-4 mini hits 94% accuracy against human labels — two points below Claude Sonnet 4.7. The 2-point gap doesn't justify ten times the API cost.

Which small model should I start with?

Phi-4 mini at 3.8B parameters (MIT license, Microsoft) for English-only structured work. Gemma 3 9B (Google) for multilingual workloads. Both run on 16GB of RAM with sensible quantization.

Can I run small models on a laptop?

Yes. Phi-4 mini at Q4_K_M quantization runs at 220+ tokens per second on an M3 Max, fits in under 3GB of RAM. A modern laptop with 16GB of RAM handles it comfortably while doing other work.

What's the accuracy gap between small models and frontier models?

On classification: 2 percentage points. On structured extraction: 3-5 points. On multi-step reasoning: 15-25 points — this is where small models fall apart and you need the frontier tier.

When should you NOT use a small language model?

Multi-step reasoning, long-context retrieval past 32K tokens, broad world knowledge, and anything where a wrong answer has dollar consequences. Use the frontier models for those calls; route the cheap workloads to small.

Changelog

May 25, 2026 — Verified pricing against current provider documentation. Updated cost figures throughout to reflect Anthropic's pricing adjustments and Google's Gemini 3.1 Pro Preview rollout.
January 22, 2026 — Corrected Phi-4 variants list to reflect actual Microsoft release (mini 3.8B and Phi-4 14B; no 4B or 7B variants exist).
February 25, 2026 — Originally published.

References

Microsoft Azure, "Phi-4 announcement," azure.microsoft.com/en-us/blog/phi-4, accessed May 2026.
Microsoft, "Phi-4-mini-instruct model card," huggingface.co/microsoft/Phi-4-mini-instruct, accessed May 2026.
Google, "Gemma," ai.google.dev/gemma, accessed May 2026.
Alibaba, "Qwen," qwen.ai, accessed May 2026.
"Hugging Face Open LLM Leaderboard," huggingface.co/spaces/open-llm-leaderboard, accessed May 2026.