RAG won the architecture war in 2024. Most teams just haven't admitted it yet.
A pipeline that produces structured changelog entries from pull request descriptions. Base-model RAG hits format-valid output about 95% of the time. A fine-tune on 800 examples pushes it to 99.4% and makes the residual failures predictable enough for a validator to catch. Training cost about $180. Inference is now free per call. That's the kind of swap where fine-tuning earns its place. And one of only three places in 2026 where it does.
The rest of the time, RAG wins. The cost numbers favor it. The operational simplicity favors it. The auditability favors it. The pages that follow are an argument for that default, with the three exceptions named in full and the cost math behind each. For the broader cost picture across workloads, see the price-per-use-case table.
I expected this piece to argue more strongly for fine-tuning than it ended up arguing. Going in, I had three production cases where fine-tuning had paid off. I went looking for more cases during the research. The cases I found were either RAG cases in disguise or workloads where prompt engineering closed the gap. The three-case taxonomy below isn't maximum coverage — it's the honest count.
What each approach actually does
RAG, in its basic form, retrieves relevant information at query time and stuffs it into the prompt. The model's weights don't change. The capability change is contextual. The model gets new facts to work with on each query.
Fine-tuning adjusts the model's weights based on a training set of input-output pairs. The model permanently learns to produce outputs of a particular shape, in a particular style, or against a particular set of constraints. New facts taught via fine-tuning are baked in. New facts that come up after training are invisible to the fine-tune.
These two approaches usually get framed as alternatives. They're alternatives only in specific situations. For most workloads, RAG addresses a problem fine-tuning can't solve, and fine-tuning addresses a problem RAG can't solve.
Why RAG wins most of the time
Three reasons, in order of weight.
RAG handles updates gracefully. Your knowledge base changes weekly in most real businesses. Re-indexing a vector store on a fresh batch of documents is a 20-minute job. Re-training a fine-tuned model on new data is a several-hour job and costs hundreds of dollars per iteration. The operational asymmetry is big.
RAG is auditable. You can inspect the retrieved chunks for each query. When the model produces a wrong answer, the cause traces to either the retrieval step or the generation step, and you can debug accordingly. Fine-tuned models are opaque. When they're wrong, the cause is a guess, and your only response is more training, which may or may not fix the underlying problem.
The cost math heavily favors RAG at sub-millions-of-queries-per-day volumes. The base model's per-token cost — verified against Anthropic's pricing page and OpenAI's API pricing — is real but stable. Retrieval cost is a few hundred milliseconds and a fraction of a cent. Total per-query cost stays under a cent for most workloads. Fine-tuning has a real up-front cost that only amortizes at very high query volumes. For the long-context alternative, see context windows compared.
One honest admission: I'm not certain the three-case taxonomy below is complete. These are the three cases I've seen fine-tuning earn its keep in production work. Fourth and fifth cases probably exist — agent-routing models that need to hit a specific decision distribution, for example — but I haven't tested them carefully enough to write about with confidence.
The three cases where fine-tuning wins
Each of these is named because each is a real scenario where the answer is fine-tuning, and the cost of choosing RAG instead is real.
Case one: strict output format compliance. Your application needs the model to produce a precisely-structured output every single time. A JSON schema with no deviation, a domain-specific markup format, a structured table with exact column ordering. With prompting and few-shot examples, the major frontier models get this right around 95% of the time. The remaining 5% is unrecoverable for some applications. Fine-tuning on 500 to 2,000 examples can push compliance to 99%+ and make the residual failure modes predictable enough to handle with a simple validator.
A real example: a pipeline producing structured changelog entries from pull request descriptions. The base-model approach got the format wrong often enough to need downstream cleanup on roughly one in twenty entries. The fine-tuned approach reaches 99.4% schema-valid output with the residual 0.6% caught by a simple validator. Training cost was about $180. The operational simplification has been worth a lot more.
Case two: domain-locked voice or style. Your application needs the model to write in a specific voice no amount of prompting reliably enforces. Brand voice for marketing copy. A legal team's writing conventions. A code-comment style consistent across a large codebase. Fine-tuning on a curated set of examples of the desired voice produces output that drifts less and needs less editing than prompting alone.
The key word is reliably. Prompting can get the right voice 80% of the time. Fine-tuning reaches 95%+. If the cost of the residual gap is high (every output edited, every output reviewed), the math tips toward fine-tuning quickly.
Case three: latency-critical hot paths. Your application has a query path with a strict latency budget (a few hundred milliseconds end-to-end) and the retrieval step in RAG eats too much of it. A fine-tuned model with the relevant knowledge in its weights can serve the request without the retrieval round-trip. For real-time apps — voice assistants, in-game NPCs, certain financial workflows — that's the only viable architecture.
The trade-off is real. The fine-tuned model is now a snapshot in time, and any knowledge update needs re-training. For latency-critical apps where the knowledge changes slowly, that's acceptable. For apps where the knowledge changes weekly, it isn't.
The three cases for fine-tuning are real. The mistake is to apply them to a problem that's actually a fourth case in disguise.
The case people keep asking about
The most-asked question is some variant of: I have a corpus of internal company documents. Should I fine-tune a model on them or build RAG? The answer is almost always RAG. The corpus isn't what determines the answer. The use case is.
If the use case is letting employees ask questions about the documents, go with RAG. The knowledge changes. You want audit. Updates need to be easy.
If the use case is generating documents in the company's writing style, fine-tune. The style is the central requirement. The underlying knowledge can still be supplied via context.
If the use case is both, the answer is RAG plus a light fine-tune on style. The fine-tuned model handles voice. The retrieval layer handles facts. Each approach does what it's best at.
Cost breakdown, January 2026 prices
| Approach | Setup cost | Per-query cost | Update cost |
|---|---|---|---|
| RAG on Claude Sonnet 4.7 | ~$200 (DB + dev time) | $0.04 / query | $10 (re-embed batch) |
| RAG on GPT-5 Mini | ~$200 | $0.008 / query | $10 |
| Fine-tune of GPT-4o-mini (1k examples) | ~$25 + dev | $0.001 / query | ~$25 per re-train |
| Fine-tune of Llama 4 8B (1k examples) | ~$60 GPU time + dev | $0 (self-hosted) | ~$60 per re-train |
| Fine-tune of Claude (enterprise tier) | Several thousand | Variable | Several thousand |
The interesting line is the second-from-bottom. A fine-tune of a small open-weight model gives you a zero-marginal-cost inference path on your own hardware. For high-volume, narrow workloads, that's the cost-optimal architecture in 2026. The trade-off is the operational burden of running the inference yourself, covered in running models on your own machine.
A question or instruction.
Vector store finds the K most-relevant chunks.
Typically 3–5 chunks, 4K tokens total.
Grounded answer, citable, $0.04 per query.
Knowledge changes weekly?
RAG Re-embed in 20 minStrict output format?
Fine-tune Push to 99%+ complianceSpecific voice/style?
Fine-tune Prompting only gets 80%Sub-300ms latency?
Fine-tune No retrieval round-tripCross-document synthesis?
Long context Worth the costAuditability matters?
RAG Inspect retrieved chunksIt's not pretty, but it works.
The default sequence
For a typical small team building a domain-specific AI feature, the recommended sequence:
- Start with base-model RAG on Claude Sonnet 4.7 or GPT-5 Mini. Measure failure modes.
- If failures concern facts or staleness, improve retrieval.
- If failures concern format compliance, try few-shot prompting first. If that doesn't close the gap, fine-tune.
- If failures concern style, prompt-engineer aggressively first. If that fails, fine-tune on a curated style corpus.
- If failures concern latency, profile the retrieval step before assuming a fine-tune is the answer.
This sequence ships faster, costs less, and produces a system you can debug. The mistake is starting with fine-tuning because it sounds more sophisticated. Sophistication isn't the goal. A system that works is the goal.
Two gaps to flag before the close. The distillation feature on OpenAI's platform docs, meant to make it cheap to fine-tune a small model on the outputs of a larger one, wasn't stress-tested here. A controlled comparison of fine-tuning approaches (LoRA versus full versus prompt tuning versus distillation) is also pending. The working wisdom is that LoRA is enough and a lot cheaper, but that deserves its own piece.
RAG wins almost every time. If you're building a knowledge-grounded AI feature in 2026 and you haven't built the RAG version first, you're optimizing for the wrong thing. The operational simplicity, the auditability, and the per-query cost dynamics all favor RAG at sub-millions-of-queries-per-day scale.
Fine-tuning wins in three cases, and only three. When output format compliance has to hit 99%+ reliability. When voice or style is central and prompting can't reliably enforce it. When latency budgets forbid the retrieval round-trip. In each case, the answer is usually a combination: fine-tune for what fine-tuning does well, retrieve for what retrieval does well.
If your team is fine-tuning because someone said you should, stop. Audit the actual failure modes of the base model on the task, and pick the right tool for what's actually broken. Most of the time, what's broken is the retrieval, the prompt, or the evaluation. Fine-tuning is a real option, but it's a smaller fraction of real-world AI work than the discourse suggests.