"Expensive model" and "cheap model" are useless categories.
A four-figure surprise invoice from a frontier-model API is a lesson most founders learn once. The model wasn't the problem. The same model that fit a back-of-envelope estimate can produce that invoice once a real production workload hits it, because the cost of a frontier model is really a property of the workload it runs. Below is a working table of what each workload shape costs, with the math behind every line.
All prices below are the published per-million-token rates from each lab's pricing sheet, verified against Anthropic's pricing page, OpenAI's API pricing, and Google's Gemini API models page. The exact numbers change often. The trade-offs themselves hold steady, and that pattern is the part worth internalizing.
The five commercial models
| Model | Input ($/M) | Output ($/M) | Cache write | Cache read |
|---|---|---|---|---|
| Claude Opus 4.7 | $5.00 | $25.00 | $6.25 | $0.50 |
| Claude Sonnet 4.6 | $3.00 | $15.00 | $3.75 | $0.30 |
| GPT-5 | $1.25 | $10.00 | n/a | $0.125 (auto) |
| GPT-5 Mini | $0.25 | $2.00 | n/a | $0.025 (auto) |
| Gemini 3.1 Pro Preview | $2.00 | $12.00 | $2.50 | $0.20 |
Two structural facts shape everything that follows. The output multiplier above is one. The other is prompt caching: where it exists, it runs at roughly a tenth of the standard input cost, so any workload that repeatedly sends the same long context should have caching turned on. On a high-volume system the difference runs into real money.
The numbers below assume well-formed prompts without conversational history bloat. In production chat, the biggest hidden cost driver is usually the unbounded context that grows turn over turn rather than the choice of model itself. Most teams with surprise bills traced them back to context growth.
Chat-style workloads are cheap
A typical conversational exchange (user asks a question, model responds, conversation history grows across a few turns) averages roughly 2,000 input tokens and 500 output tokens per turn. On Claude Sonnet 4.6, that's $0.014 per turn. On Claude Opus, $0.022. On GPT-5 Mini, $0.0015. On Gemini 3.1 Pro Preview, $0.010.
For a chat product with 10,000 daily active users averaging five turns each per day, that works out to between $75 and $1,100 daily, depending on model choice. Annualized: roughly $27,000 to $400,000. At this scale your model choice matters more than your user count.
The hidden cost trap is unbounded conversation history. By turn 20 of a long conversation, the context can have grown to 10,000+ tokens, and the per-turn cost has tripled even though the marginal value to the user hasn't changed. Summarizing or truncating history aggressively fixes it, though most chat products never get around to it.
RAG gets expensive faster than you'd expect
A standard retrieval-augmented generation pattern (embed the query, retrieve five chunks from a vector store, stuff them into the prompt, generate a response) averages roughly 8,000 input tokens and 800 output tokens per query. On Opus, $0.06 per query. On Sonnet, $0.036. On GPT-5 Mini, $0.0036. For the broader argument on when to use RAG vs. long context, see RAG vs fine-tuning.
At 1,000 queries per day, the daily cost ranges from $4 to $60. At 100,000 queries per day (a scale a popular product can reach faster than you expect) the daily cost ranges from $360 to $6,000.
Two specific things drive up the bill. First, RAG queries often pass through a re-ranking step that adds a model call. That's another input-token pass over the candidate chunks. Second, many production RAG systems retrieve more chunks than they need on the theory that more context can't hurt. It can. Every extra chunk is more input tokens on every single query.
Almost everyone underestimates the output side of the bill — which is doubly unfortunate, because output is the easiest part of the whole prompt to rein in.
Batch document processing
The pattern: take a corpus of N documents, send each through the model, get back a structured response. Classification and extraction and summarization, all at scale. The cost is linear in input tokens times document length plus output tokens times answer length, multiplied by N.
Where this gets surprising is when retries and partial failures get forgotten. A naive batch pipeline that re-runs every failed call is fine when the failure rate is 1%. It's a budget catastrophe when the failure rate is 8%, which isn't unusual when running against a noisy upstream service. The cost of the retries on a bad timeout policy can match or exceed the cost of the primary run.
The fix is to use the batch API tiers most providers offer at roughly 50% off the standard price, with 24-hour turnaround. For any workload that isn't latency-sensitive, that discount is close to free, and most teams never claim it. If the workload is small enough to fit on a local 4B model, see small language models for an even cheaper option.
(A side-note: the $50+ agent session figure isn't hypothetical. The community has seen concrete instances over the past year of agent loops running for hours before someone noticed. A per-day spending cap catches some of them; others run until the API key hits its monthly limit. Set hard caps before you ship the loop.)
Agent loops are the genuine danger
Agent workflows are where the surprising bills come from, because the cost shape is hostile to estimation. An agent that makes three tool calls and returns is cheap; one that gets stuck in a loop and makes 200 tool calls before timing out is two orders of magnitude more expensive. Both happen routinely. The cheap path is the one you designed; the expensive one shows up when something upstream breaks, the agent doesn't notice, and it keeps retrying with subtly different prompts.
The schema-change incident is common enough to recognize. A database schema changes, and the tool the agent depends on starts returning errors. The agent keeps calling the tool, getting errors, asking the model what to do, and trying again. The model says try again with a different approach, and the loop runs for hours before a human notices.
The defenses are well-known and worth implementing: hard token caps per session, hard wall-clock timeouts, automatic alerts when per-session cost crosses a threshold, and a per-day spending cap at the provider level. Turn on Anthropic's spending limits and OpenAI's usage limits even when the system feels overcautious. Especially then. For more on agent failure modes, see AI agents, eighteen months in.
(One related trap: most teams the community has seen don't verify their cache-hit rate. They turn on caching in the SDK config, ship, and assume it's working. The real rate often runs lower than expected, because a prompt that drifts token-by-token across sessions, a timestamp here, a session ID there, quietly breaks the cache. Audit the bill against the cache-hit metric. If the system pays full price on every call, the discount isn't landing.)
Caching changes the math a lot
Prompt caching is the biggest cost-saver introduced in the past two years, and somehow still the least talked about. If your workload sends a large fixed context with each request (a long system prompt, say, or a fixed set of reference documents) caching that prefix drops the input cost on later calls to roughly 10% of standard. For a RAG system with a 5,000-token system prompt firing 100,000 times a day, the savings on Opus alone come to about $675 per day, or roughly $245,000 per year.
This gets implemented far less often than it should. Teams either don't know caching exists, aren't sure which prefix to cache, or have wired it up wrong and never see the discount. Audit your bill against your cache-hit rate. If the system pays full input price on every call, that's money walking out the door.
Chat
Haiku 4.5 $1.00/1M inCoding
Opus 4.7 Pay for correctnessRAG
Sonnet 4.6 Sweet spot at scaleAgents
Opus 4.7 Cap your budgetClassification
Phi-4 mini Local, ~$0 per taskSummarization
Gemini 3 Flash $0.30/$2.50 per 1MChat, RAG, agent, classification, summarization?
Production code = pay for Opus. Email triage = small model.
Above 100K req/day, every cent matters. Cache aggressively.
Set a spending limit before scaling. Always.
The rough per-call cost worth keeping in your head
| Workload | Opus | Sonnet | GPT-5 Mini |
|---|---|---|---|
| Simple chat turn (2k / 500) | $0.022 | $0.014 | $0.0015 |
| RAG query (8k / 800) | $0.06 | $0.036 | $0.0036 |
| Document summary (50k / 2k) | $0.30 | $0.18 | $0.017 |
| Long-doc analysis (200k / 4k) | $1.10 | $0.66 | $0.058 |
| Agent session (variable) | $0.50 – $50+ | $0.10 – $10+ | $0.01 – $1+ |
The pattern the community has seen work most often in 2026: estimate the workload shape before building, then pick the model that fits that shape rather than the one with the best marketing. Cache every workload that re-sends a fixed prefix. Put hard caps on every agent loop. Run batch workloads through the batch tier when latency permits. None of this is exotic; it's the basic hygiene most teams skip.
For solo founders or small teams: start production workloads on Claude Sonnet 4.6 or GPT-5 Mini, and save Opus and GPT-5 for the calls that genuinely need frontier capability. Watch the bill weekly. The cost dynamics shift quickly, and once traffic moves, last month's bill tells you very little about next month's.
Turn on the spending limits today. The four-figure surprise invoice is a lesson you can skip.