RAG won the architecture war in 2024. Most teams (likely yours) just haven't admitted it yet.
Take a pipeline that produces structured changelog entries from pull request descriptions. Base-model RAG hits format-valid output about 95% of the time. A fine-tune on 800 examples pushes it to 99.4% and makes the residual failures predictable enough for a validator to catch. Training cost about $180, and inference is now free per call. That's the kind of swap where fine-tuning earns its place — one of only three places in 2026 where it does.
The rest of the time, RAG wins — on cost, on operational simplicity, and on auditability. The pages that follow are an argument for that default, with the three exceptions named in full and the cost math behind each. For the broader cost picture across workloads, see the price-per-use-case table.
The early framing of the open-source discussion expected this analysis to argue more strongly for fine-tuning than it does. The cases the community keeps surfacing turn out to be either RAG cases in disguise or workloads where prompt engineering closes the gap. So the taxonomy below stops at three, which is as far as the evidence honestly reaches.
What each approach does
RAG, in its basic form, retrieves relevant information at query time and stuffs it into the prompt. The weights stay frozen; the change is purely contextual, giving the model new facts to work with on each query.
Fine-tuning adjusts the model's weights based on a training set of input-output pairs. The model permanently learns to produce outputs of a particular shape, style, or set of constraints. Whatever facts you teach it during training get baked in, but anything that comes up afterward stays invisible to the fine-tune.
These two approaches usually get framed as alternatives, but they only compete in specific situations. For most workloads they're solving different problems, and the one each handles is the one the other can't touch.
Why RAG wins most of the time
Three reasons, in order of weight.
RAG handles updates gracefully. In most actual businesses the knowledge base changes weekly, and re-indexing a vector store on a fresh batch of documents takes about 20 minutes. Re-training a fine-tuned model on the same new data runs several hours and costs hundreds of dollars per iteration. That gap in operational effort is hard to overstate.
RAG is auditable. You can inspect the retrieved chunks for each query, and when the model produces a wrong answer the cause traces back to either the retrieval step or the generation step, so you know where to debug. A fine-tuned model gives you none of that. When it's wrong you're guessing at why, and the only lever you have is more training, which may or may not fix the underlying problem.
The cost math heavily favors RAG at sub-millions-of-queries-per-day volumes. The base model's per-token cost) verified against Anthropic's pricing page and OpenAI's API pricing (is genuine but stable. Retrieval adds a few hundred milliseconds and a fraction of a cent, so total per-query cost stays under a cent for most workloads. Fine-tuning carries a real up-front cost that only amortizes at very high query volumes. For the long-context alternative, see context windows compared.
One honest admission: I am not certain the three-case taxonomy below is complete. These are the three cases where fine-tuning earns its keep. Fourth and fifth cases probably exist (agent-routing models that need to hit a specific decision distribution, for example) but the community hasn't tested them carefully enough to write about with confidence.
The three cases where fine-tuning wins
Each of these is named because each is a concrete scenario where the answer is fine-tuning, and the cost of choosing RAG instead is real.
Case one: strict output format compliance. Your application needs the model to produce a precisely-structured output every single time — a JSON schema with no deviation, say, or a structured table with exact column ordering. With prompting and few-shot examples, the major frontier models get this right around 95% of the time, and for some applications the remaining 5% is unrecoverable. Fine-tuning on 500 to 2,000 examples can push compliance to 99%+ and make the residual failure modes predictable enough to handle with a simple validator.
A concrete example: a pipeline producing structured changelog entries from pull request descriptions. The base-model approach got the format wrong often enough to need downstream cleanup on roughly one in twenty entries. The fine-tuned approach reaches 99.4% schema-valid output with the residual 0.6% caught by a simple validator. Training cost was about $180. The operational simplification has been worth a lot more.
Case two: domain-locked voice or style. Your application needs the model to write in a specific voice no amount of prompting reliably enforces — brand voice for marketing copy, or a legal team's writing conventions, or a code-comment style that has to stay consistent across a large codebase. Fine-tuning on a curated set of examples of the desired voice produces output that drifts less and needs less editing than prompting alone.
The key word is reliably. Prompting can get the right voice 80% of the time, where fine-tuning reaches 95%+. If the residual gap is expensive to live with — every output edited and reviewed by hand — the math tips toward fine-tuning quickly.
Case three: latency-critical hot paths. Your application has a query path with a strict latency budget (a few hundred milliseconds end-to-end) and the retrieval step in RAG eats too much of it. A fine-tuned model with the relevant knowledge in its weights can serve the request without the retrieval round-trip. For real-time apps like voice assistants and in-game NPCs, that's often the only viable architecture.
The trade-off is real: the fine-tuned model is now a snapshot in time, and any knowledge update means re-training. That's fine when the underlying knowledge changes slowly, and a dealbreaker when it changes every week.
Where teams go wrong is stretching these three cases to cover a problem that only looks like one of them.
The case people keep asking about
The most-asked question is some variant of: I have a corpus of internal company documents. Should I fine-tune a model on them or build RAG? The answer is almost always RAG. What settles it is the use case, not the corpus itself.
If the use case is letting employees ask questions about the documents, go with RAG. The knowledge keeps changing, you want updates to be easy, and you want to be able to audit where each answer came from.
If the use case is generating documents in the company's writing style, fine-tune. Here the style is the central requirement, while the underlying knowledge can still be supplied through context.
If the use case is both, the answer is RAG plus a light fine-tune on style — the fine-tuned model carries the voice while the retrieval layer supplies the facts.
Cost breakdown, January 2026 prices
| Approach | Setup cost | Per-query cost | Update cost |
|---|---|---|---|
| RAG on Claude Sonnet 4.6 | ~$200 (DB + dev time) | $0.04 / query | $10 (re-embed batch) |
| RAG on GPT-5 Mini | ~$200 | $0.004 / query | $10 |
| Fine-tune of GPT-4o-mini (1k examples) | ~$25 + dev | $0.001 / query | ~$25 per re-train |
| Fine-tune of Llama 3.1 8B (1k examples) | ~$60 GPU time + dev | $0 (self-hosted) | ~$60 per re-train |
| Fine-tune of Claude (enterprise tier) | Several thousand | Variable | Several thousand |
The interesting line is the second-from-bottom. A fine-tune of a small open-weight model gives you a zero-marginal-cost inference path on your own hardware. For high-volume, narrow workloads, that's the cost-optimal architecture in 2026. The trade-off is the operational burden of running the inference yourself, covered in running models on your own machine.
A question or instruction.
Vector store finds the K most-relevant chunks.
Typically 3–5 chunks, 4K tokens total.
Grounded answer, citable, $0.04 per query.
Knowledge changes weekly?
RAG Re-embed in 20 minStrict output format?
Fine-tune Push to 99%+ complianceSpecific voice/style?
Fine-tune Prompting only gets 80%Sub-300ms latency?
Fine-tune No retrieval round-tripCross-document synthesis?
Long context Worth the costAuditability matters?
RAG Inspect retrieved chunksIt's not pretty, but it works.
The default sequence
For a typical small team building a domain-specific AI feature, the recommended sequence:
- Start with base-model RAG on Claude Sonnet 4.6 or GPT-5 Mini. Measure failure modes.
- If failures concern facts or staleness, improve retrieval.
- If failures concern format compliance, try few-shot prompting first. If that doesn't close the gap, fine-tune.
- If failures concern style, prompt-engineer aggressively first. If that fails, fine-tune on a curated style corpus.
- If failures concern latency, profile the retrieval step before assuming a fine-tune is the answer.
This sequence ships faster, costs less, and produces a system you can debug. The mistake is starting with fine-tuning because it sounds more sophisticated, when what you want is the thing that works.
Two gaps to flag before the close. The distillation feature on OpenAI's platform docs, meant to make it cheap to fine-tune a small model on the outputs of a larger one, wasn't stress-tested here. A controlled comparison of fine-tuning approaches (LoRA versus full versus prompt tuning versus distillation) is also pending. The working wisdom is that LoRA is enough and a lot cheaper, but that deserves its own piece.
RAG wins almost every time. If you're building a knowledge-grounded AI feature in 2026 and you haven't built the RAG version first, you're optimizing for the wrong thing. At sub-millions-of-queries-per-day scale, the operational simplicity and auditability point one way, and the per-query cost dynamics point the same way.
Fine-tuning wins in three cases, and only three: when output format compliance has to hit 99%+ reliability, when voice or style is central and prompting can't reliably enforce it, and when latency budgets forbid the retrieval round-trip. Even then the answer is usually a combination, with each technique carrying the part of the job it handles best.
If your team is fine-tuning because someone said you should, stop. Audit the actual failure modes of the base model on the task, and pick the right tool for what's broken. Most of the time the culprit turns out to be the retrieval or the prompt, sometimes the evaluation itself. Fine-tuning is a real option, just a much smaller slice of production AI work than the discourse suggests.