Essay·April 2026

RAG vs fine-tuning, with the math

Cost numbers across both approaches, and the three specific scenarios where fine-tuning still pays off.

Updated May 25, 2026 · View changelog

RAG per query $0.04 Sonnet 4.6 with retrieval

Long-context same query $1.00 200K tokens, Opus

First-token latency 300ms RAG round-trip + model

Fine-tune cost $180 For a 1K-example dataset

RAG won the architecture war in 2024. Most teams (likely yours) just haven't admitted it yet.

Take a pipeline that produces structured changelog entries from pull request descriptions. Base-model RAG hits format-valid output about 95% of the time. A fine-tune on 800 examples pushes it to 99.4% and makes the residual failures predictable enough for a validator to catch. Training cost about $180, and inference is now free per call. That's the kind of swap where fine-tuning earns its place — one of only three places in 2026 where it does.

The rest of the time, RAG wins — on cost, on operational simplicity, and on auditability. The pages that follow are an argument for that default, with the three exceptions named in full and the cost math behind each. For the broader cost picture across workloads, see the price-per-use-case table.

The early framing of the open-source discussion expected this analysis to argue more strongly for fine-tuning than it does. The cases the community keeps surfacing turn out to be either RAG cases in disguise or workloads where prompt engineering closes the gap. So the taxonomy below stops at three, which is as far as the evidence honestly reaches.

What each approach does

RAG, in its basic form, retrieves relevant information at query time and stuffs it into the prompt. The weights stay frozen; the change is purely contextual, giving the model new facts to work with on each query.

Fine-tuning adjusts the model's weights based on a training set of input-output pairs. The model permanently learns to produce outputs of a particular shape, style, or set of constraints. Whatever facts you teach it during training get baked in, but anything that comes up afterward stays invisible to the fine-tune.

These two approaches usually get framed as alternatives, but they only compete in specific situations. For most workloads they're solving different problems, and the one each handles is the one the other can't touch.

Why RAG wins most of the time

Three reasons, in order of weight.

RAG handles updates gracefully. In most actual businesses the knowledge base changes weekly, and re-indexing a vector store on a fresh batch of documents takes about 20 minutes. Re-training a fine-tuned model on the same new data runs several hours and costs hundreds of dollars per iteration. That gap in operational effort is hard to overstate.

RAG is auditable. You can inspect the retrieved chunks for each query, and when the model produces a wrong answer the cause traces back to either the retrieval step or the generation step, so you know where to debug. A fine-tuned model gives you none of that. When it's wrong you're guessing at why, and the only lever you have is more training, which may or may not fix the underlying problem.

The cost math heavily favors RAG at sub-millions-of-queries-per-day volumes. The base model's per-token cost) verified against Anthropic's pricing page and OpenAI's API pricing (is genuine but stable. Retrieval adds a few hundred milliseconds and a fraction of a cent, so total per-query cost stays under a cent for most workloads. Fine-tuning carries a real up-front cost that only amortizes at very high query volumes. For the long-context alternative, see context windows compared.

RAG vs fine-tuning vs long-context, five dimensions /100

Higher is better. Lower-better dimensions (cost, latency) inverted for the chart.

RAG (cost score

Fine-tune) cost score

Long-context (cost score

RAG) flexibility

Fine-tune (format compliance

Long-context) synthesis

25× RAG is 25× cheaper than long-context for precise lookup

One honest admission: I am not certain the three-case taxonomy below is complete. These are the three cases where fine-tuning earns its keep. Fourth and fifth cases probably exist (agent-routing models that need to hit a specific decision distribution, for example) but the community hasn't tested them carefully enough to write about with confidence.

The three cases where fine-tuning wins

Each of these is named because each is a concrete scenario where the answer is fine-tuning, and the cost of choosing RAG instead is real.

Case one: strict output format compliance. Your application needs the model to produce a precisely-structured output every single time — a JSON schema with no deviation, say, or a structured table with exact column ordering. With prompting and few-shot examples, the major frontier models get this right around 95% of the time, and for some applications the remaining 5% is unrecoverable. Fine-tuning on 500 to 2,000 examples can push compliance to 99%+ and make the residual failure modes predictable enough to handle with a simple validator.

A concrete example: a pipeline producing structured changelog entries from pull request descriptions. The base-model approach got the format wrong often enough to need downstream cleanup on roughly one in twenty entries. The fine-tuned approach reaches 99.4% schema-valid output with the residual 0.6% caught by a simple validator. Training cost was about $180. The operational simplification has been worth a lot more.

Case two: domain-locked voice or style. Your application needs the model to write in a specific voice no amount of prompting reliably enforces — brand voice for marketing copy, or a legal team's writing conventions, or a code-comment style that has to stay consistent across a large codebase. Fine-tuning on a curated set of examples of the desired voice produces output that drifts less and needs less editing than prompting alone.

The key word is reliably. Prompting can get the right voice 80% of the time, where fine-tuning reaches 95%+. If the residual gap is expensive to live with — every output edited and reviewed by hand — the math tips toward fine-tuning quickly.

Case three: latency-critical hot paths. Your application has a query path with a strict latency budget (a few hundred milliseconds end-to-end) and the retrieval step in RAG eats too much of it. A fine-tuned model with the relevant knowledge in its weights can serve the request without the retrieval round-trip. For real-time apps like voice assistants and in-game NPCs, that's often the only viable architecture.

The trade-off is real: the fine-tuned model is now a snapshot in time, and any knowledge update means re-training. That's fine when the underlying knowledge changes slowly, and a dealbreaker when it changes every week.

Where teams go wrong is stretching these three cases to cover a problem that only looks like one of them.

The case people keep asking about

The most-asked question is some variant of: I have a corpus of internal company documents. Should I fine-tune a model on them or build RAG? The answer is almost always RAG. What settles it is the use case, not the corpus itself.

If the use case is letting employees ask questions about the documents, go with RAG. The knowledge keeps changing, you want updates to be easy, and you want to be able to audit where each answer came from.

If the use case is generating documents in the company's writing style, fine-tune. Here the style is the central requirement, while the underlying knowledge can still be supplied through context.

If the use case is both, the answer is RAG plus a light fine-tune on style — the fine-tuned model carries the voice while the retrieval layer supplies the facts.

Cost breakdown, January 2026 prices

RAG vs. fine-tune cost, January 2026 prices
Approach	Setup cost	Per-query cost	Update cost
RAG on Claude Sonnet 4.6	~$200 (DB + dev time)	$0.04 / query	$10 (re-embed batch)
RAG on GPT-5 Mini	~$200	$0.004 / query	$10
Fine-tune of GPT-4o-mini (1k examples)	~$25 + dev	$0.001 / query	~$25 per re-train
Fine-tune of Llama 3.1 8B (1k examples)	~$60 GPU time + dev	$0 (self-hosted)	~$60 per re-train
Fine-tune of Claude (enterprise tier)	Several thousand	Variable	Several thousand

The interesting line is the second-from-bottom. A fine-tune of a small open-weight model gives you a zero-marginal-cost inference path on your own hardware. For high-volume, narrow workloads, that's the cost-optimal architecture in 2026. The trade-off is the operational burden of running the inference yourself, covered in running models on your own machine.

1. User query

A question or instruction.

↓

2. Embed → search

Vector store finds the K most-relevant chunks.

↓

3. Retrieve top chunks

Typically 3–5 chunks, 4K tokens total.

↓

4. Generate with context

Grounded answer, citable, $0.04 per query.

Knowledge changes weekly?

RAG Re-embed in 20 min

Strict output format?

Fine-tune Push to 99%+ compliance

Specific voice/style?

Fine-tune Prompting only gets 80%

Sub-300ms latency?

Fine-tune No retrieval round-trip

Cross-document synthesis?

Long context Worth the cost

Auditability matters?

RAG Inspect retrieved chunks

It's not pretty, but it works.

The default sequence

For a typical small team building a domain-specific AI feature, the recommended sequence:

Start with base-model RAG on Claude Sonnet 4.6 or GPT-5 Mini. Measure failure modes.
If failures concern facts or staleness, improve retrieval.
If failures concern format compliance, try few-shot prompting first. If that doesn't close the gap, fine-tune.
If failures concern style, prompt-engineer aggressively first. If that fails, fine-tune on a curated style corpus.
If failures concern latency, profile the retrieval step before assuming a fine-tune is the answer.

This sequence ships faster, costs less, and produces a system you can debug. The mistake is starting with fine-tuning because it sounds more sophisticated, when what you want is the thing that works.

Two gaps to flag before the close. The distillation feature on OpenAI's platform docs, meant to make it cheap to fine-tune a small model on the outputs of a larger one, wasn't stress-tested here. A controlled comparison of fine-tuning approaches (LoRA versus full versus prompt tuning versus distillation) is also pending. The working wisdom is that LoRA is enough and a lot cheaper, but that deserves its own piece.

RAG wins almost every time. If you're building a knowledge-grounded AI feature in 2026 and you haven't built the RAG version first, you're optimizing for the wrong thing. At sub-millions-of-queries-per-day scale, the operational simplicity and auditability point one way, and the per-query cost dynamics point the same way.

Fine-tuning wins in three cases, and only three: when output format compliance has to hit 99%+ reliability, when voice or style is central and prompting can't reliably enforce it, and when latency budgets forbid the retrieval round-trip. Even then the answer is usually a combination, with each technique carrying the part of the job it handles best.

If your team is fine-tuning because someone said you should, stop. Audit the actual failure modes of the base model on the task, and pick the right tool for what's broken. Most of the time the culprit turns out to be the retrieval or the prompt, sometimes the evaluation itself. Fine-tuning is a real option, just a much smaller slice of production AI work than the discourse suggests.

Frequently asked

RAG or fine-tuning — which should I use?

RAG almost every time. The cost dynamics, the operational simplicity, and the auditability all favor RAG. Fine-tuning wins in three specific cases: strict format compliance, domain-locked voice, and sub-300ms latency requirements.

How much cheaper is RAG vs fine-tuning?

Per-query, RAG runs about $0.04 on Sonnet 4.6 with retrieval. A fine-tuned small model runs essentially zero marginal cost after the ~$60-180 training run. Fine-tuning wins on per-query cost at scale, but only for the workloads it's right for.

When does fine-tuning beat RAG?

Three cases. One: strict output format where compliance must hit 99%+ (RAG hits ~95% on prompting alone). Two: a specific voice or style prompting can't reliably enforce. Three: latency hot paths where the retrieval round-trip is too slow.

Can I combine RAG and fine-tuning?

Yes, and you usually should. Fine-tune for voice and format. Use RAG for facts. Each approach handles what it's best at. Most production systems that use fine-tuning correctly are running it alongside RAG, not instead of it.

How long does it take to fine-tune a model?

Several hours on a typical 1,000-example dataset. Cost: $25-$180 depending on the base model. Re-training on updated data costs the same again. RAG re-indexing on the same knowledge update takes 20 minutes and a few cents.

Changelog

May 25, 2026 — Verified pricing against current provider documentation. Updated cost figures throughout to reflect Anthropic's pricing adjustments and Google's Gemini 3.1 Pro Preview rollout.
April 17, 2026 — Originally published.

References

OpenAI, "Platform documentation," platform.openai.com/docs, accessed May 2026.
OpenAI, "API Pricing," openai.com/api/pricing, accessed May 2026.
Anthropic, "Claude API Documentation," docs.claude.com, accessed May 2026.
Anthropic, "Pricing," anthropic.com/pricing, accessed May 2026.