Comparison·May 2026

The price-per-use-case table

Q: Why do output tokens cost more than input?

Generation is harder than reading — more compute per token. Every model prices output at 3-5× input. Most teams budget around the input side and underestimate the output side, which is where the money goes.

What you pay for AI in 2026, by workload. Real numbers, with the math.

Updated May 25, 2026 · View changelog

Use cases compared 6 Chat to agents

Cheapest input $0.10 Per 1M tokens) Llama Scout

Most expensive $75 Per 1M output (Claude Opus

Output multiplier 3–5× vs input on every model

"Expensive model" and "cheap model" are useless categories.

A four-figure surprise invoice from a frontier-model API is a lesson most founders learn once. The model wasn't the problem. The same model that fit a back-of-envelope estimate can produce that invoice once a real production workload hits it, because the cost of a frontier model is really a property of the workload it runs. Below is a working table of what each workload shape costs, with the math behind every line.

All prices below are the published per-million-token rates from each lab's pricing sheet, verified against Anthropic's pricing page, OpenAI's API pricing, and Google's Gemini API models page. The exact numbers change often. The trade-offs themselves hold steady, and that pattern is the part worth internalizing.

The five commercial models

Commercial frontier pricing, May 2026, per provider docs
Model	Input ($/M)	Output ($/M)	Cache write	Cache read
Claude Opus 4.7	$5.00	$25.00	$6.25	$0.50
Claude Sonnet 4.6	$3.00	$15.00	$3.75	$0.30
GPT-5	$1.25	$10.00	n/a	$0.125 (auto)
GPT-5 Mini	$0.25	$2.00	n/a	$0.025 (auto)
Gemini 3.1 Pro Preview	$2.00	$12.00	$2.50	$0.20

Two structural facts shape everything that follows. The output multiplier above is one. The other is prompt caching: where it exists, it runs at roughly a tenth of the standard input cost, so any workload that repeatedly sends the same long context should have caching turned on. On a high-volume system the difference runs into real money.

The numbers below assume well-formed prompts without conversational history bloat. In production chat, the biggest hidden cost driver is usually the unbounded context that grows turn over turn rather than the choice of model itself. Most teams with surprise bills traced them back to context growth.

Chat-style workloads are cheap

A typical conversational exchange (user asks a question, model responds, conversation history grows across a few turns) averages roughly 2,000 input tokens and 500 output tokens per turn. On Claude Sonnet 4.6, that's $0.014 per turn. On Claude Opus, $0.022. On GPT-5 Mini, $0.0015. On Gemini 3.1 Pro Preview, $0.010.

For a chat product with 10,000 daily active users averaging five turns each per day, that works out to between $75 and $1,100 daily, depending on model choice. Annualized: roughly $27,000 to $400,000. At this scale your model choice matters more than your user count.

The hidden cost trap is unbounded conversation history. By turn 20 of a long conversation, the context can have grown to 10,000+ tokens, and the per-turn cost has tripled even though the marginal value to the user hasn't changed. Summarizing or truncating history aggressively fixes it, though most chat products never get around to it.

RAG gets expensive faster than you'd expect

A standard retrieval-augmented generation pattern (embed the query, retrieve five chunks from a vector store, stuff them into the prompt, generate a response) averages roughly 8,000 input tokens and 800 output tokens per query. On Opus, $0.06 per query. On Sonnet, $0.036. On GPT-5 Mini, $0.0036. For the broader argument on when to use RAG vs. long context, see RAG vs fine-tuning.

At 1,000 queries per day, the daily cost ranges from $4 to $60. At 100,000 queries per day (a scale a popular product can reach faster than you expect) the daily cost ranges from $360 to $6,000.

Two specific things drive up the bill. First, RAG queries often pass through a re-ranking step that adds a model call. That's another input-token pass over the candidate chunks. Second, many production RAG systems retrieve more chunks than they need on the theory that more context can't hurt. It can. Every extra chunk is more input tokens on every single query.

Almost everyone underestimates the output side of the bill — which is doubly unfortunate, because output is the easiest part of the whole prompt to rein in.

Cost per typical task across 6 use cases

Cents per request, Claude Sonnet 4.6. Lower is better.

Classification

$0.0006

Simple chat turn

$0.014

RAG query

$0.036

Document summary

$0.18

Long-doc analysis

$0.66

Agent session

$0.10–10+

5× Output tokens cost 5× input on most frontier models

Batch document processing

The pattern: take a corpus of N documents, send each through the model, get back a structured response. Classification and extraction and summarization, all at scale. The cost is linear in input tokens times document length plus output tokens times answer length, multiplied by N.

Where this gets surprising is when retries and partial failures get forgotten. A naive batch pipeline that re-runs every failed call is fine when the failure rate is 1%. It's a budget catastrophe when the failure rate is 8%, which isn't unusual when running against a noisy upstream service. The cost of the retries on a bad timeout policy can match or exceed the cost of the primary run.

The fix is to use the batch API tiers most providers offer at roughly 50% off the standard price, with 24-hour turnaround. For any workload that isn't latency-sensitive, that discount is close to free, and most teams never claim it. If the workload is small enough to fit on a local 4B model, see small language models for an even cheaper option.

(A side-note: the $50+ agent session figure isn't hypothetical. The community has seen concrete instances over the past year of agent loops running for hours before someone noticed. A per-day spending cap catches some of them; others run until the API key hits its monthly limit. Set hard caps before you ship the loop.)

Agent loops are the genuine danger

Agent workflows are where the surprising bills come from, because the cost shape is hostile to estimation. An agent that makes three tool calls and returns is cheap; one that gets stuck in a loop and makes 200 tool calls before timing out is two orders of magnitude more expensive. Both happen routinely. The cheap path is the one you designed; the expensive one shows up when something upstream breaks, the agent doesn't notice, and it keeps retrying with subtly different prompts.

The schema-change incident is common enough to recognize. A database schema changes, and the tool the agent depends on starts returning errors. The agent keeps calling the tool, getting errors, asking the model what to do, and trying again. The model says try again with a different approach, and the loop runs for hours before a human notices.

The defenses are well-known and worth implementing: hard token caps per session, hard wall-clock timeouts, automatic alerts when per-session cost crosses a threshold, and a per-day spending cap at the provider level. Turn on Anthropic's spending limits and OpenAI's usage limits even when the system feels overcautious. Especially then. For more on agent failure modes, see AI agents, eighteen months in.

(One related trap: most teams the community has seen don't verify their cache-hit rate. They turn on caching in the SDK config, ship, and assume it's working. The real rate often runs lower than expected, because a prompt that drifts token-by-token across sessions, a timestamp here, a session ID there, quietly breaks the cache. Audit the bill against the cache-hit metric. If the system pays full price on every call, the discount isn't landing.)

Caching changes the math a lot

Prompt caching is the biggest cost-saver introduced in the past two years, and somehow still the least talked about. If your workload sends a large fixed context with each request (a long system prompt, say, or a fixed set of reference documents) caching that prefix drops the input cost on later calls to roughly 10% of standard. For a RAG system with a 5,000-token system prompt firing 100,000 times a day, the savings on Opus alone come to about $675 per day, or roughly $245,000 per year.

This gets implemented far less often than it should. Teams either don't know caching exists, aren't sure which prefix to cache, or have wired it up wrong and never see the discount. Audit your bill against your cache-hit rate. If the system pays full input price on every call, that's money walking out the door.

Chat

Haiku 4.5 $1.00/1M in

Coding

Opus 4.7 Pay for correctness

RAG

Sonnet 4.6 Sweet spot at scale

Agents

Opus 4.7 Cap your budget

Classification

Phi-4 mini Local, ~$0 per task

Summarization

Gemini 3 Flash $0.30/$2.50 per 1M

1. What's the workload?

Chat, RAG, agent, classification, summarization?

↓

2. How sensitive to wrong answers?

Production code = pay for Opus. Email triage = small model.

↓

3. Volume?

Above 100K req/day, every cent matters. Cache aggressively.

↓

4. Pick model + cap budget

Set a spending limit before scaling. Always.

The rough per-call cost worth keeping in your head

Per-call cost by workload shape, January 2026 pricing
Workload	Opus	Sonnet	GPT-5 Mini
Simple chat turn (2k / 500)	$0.022	$0.014	$0.0015
RAG query (8k / 800)	$0.06	$0.036	$0.0036
Document summary (50k / 2k)	$0.30	$0.18	$0.017
Long-doc analysis (200k / 4k)	$1.10	$0.66	$0.058
Agent session (variable)	$0.50 – $50+	$0.10 – $10+	$0.01 – $1+

The pattern the community has seen work most often in 2026: estimate the workload shape before building, then pick the model that fits that shape rather than the one with the best marketing. Cache every workload that re-sends a fixed prefix. Put hard caps on every agent loop. Run batch workloads through the batch tier when latency permits. None of this is exotic; it's the basic hygiene most teams skip.

For solo founders or small teams: start production workloads on Claude Sonnet 4.6 or GPT-5 Mini, and save Opus and GPT-5 for the calls that genuinely need frontier capability. Watch the bill weekly. The cost dynamics shift quickly, and once traffic moves, last month's bill tells you very little about next month's.

Turn on the spending limits today. The four-figure surprise invoice is a lesson you can skip.

Frequently asked

How much does it cost to run an AI chat product?

On Claude Sonnet 4.6, a typical chat turn (~2K input + 500 output tokens) costs $0.014. For 10,000 daily active users averaging 5 turns each, daily cost runs about $700, or $250K/year before any optimization.

Why do output tokens cost more than input?

Generation is harder than reading: more compute per token. Every model prices output at 3-5× input. Most teams budget around the input side and underestimate the output side, which is where the money goes.

What's the cheapest AI for high-volume work?

For frontier-class quality, GPT-5 Mini at $0.25 input / $2 output per million tokens. For open-weight on a hosted endpoint, Llama 4 Scout at $0.10 / $0.40. For self-hosted, Phi-4 mini at zero marginal cost after hardware.

How much does prompt caching save?

Cached prefix tokens run at ~10% of the standard input rate on Anthropic and Google. For a RAG system with a 5K-token system prompt firing 100,000 times daily, caching saves about $675/day on Opus alone.

How do I cap an AI agent's runaway spending?

Hard token caps per session, wall-clock timeouts, automatic alerts when per-session cost crosses a threshold, and per-day spending limits at the provider level. Turn on Anthropic and OpenAI's usage limits even when they feel overcautious.

Changelog

May 25, 2026 — Verified pricing against current provider documentation. Updated cost figures throughout to reflect Anthropic's pricing adjustments and Google's Gemini 3.1 Pro Preview rollout.
May 8, 2026 — Originally published.

References

Anthropic, "Pricing," anthropic.com/pricing, accessed May 2026.
OpenAI, "API Pricing," openai.com/api/pricing, accessed May 2026.
Google, "Gemini API models," ai.google.dev/gemini-api/docs/models, accessed May 2026.
Google Cloud, "Vertex AI generative AI pricing," cloud.google.com/vertex-ai/generative-ai/pricing, accessed May 2026.