benchr Issue No. 07

The price-per-use-case table

What you actually pay for AI in 2026, by workload. Real numbers, with the math.

· View changelog

Use cases compared 6 Chat to agents
Cheapest input $0.10 Per 1M tokens — Llama Scout
Most expensive $75 Per 1M output — Claude Opus
Output multiplier 3–5× vs input on every model

"Expensive model" and "cheap model" are useless categories.

A four-figure surprise invoice from a frontier-model API is a lesson most founders learn once. The lesson isn't that the model was expensive. The same model that fit in a back-of-envelope estimate can produce a four-figure invoice once a real workload hits it. The cost of a frontier model is a property of the workload, not the model. What follows is a working table of what each workload shape actually costs, with the math behind every line.

All prices below are the published per-million-token rates from each lab's pricing sheet — verified against Anthropic's pricing page, OpenAI's API pricing, and Google's Gemini API models page. Prices change. The structure of the trade-offs is more stable than the prices themselves, and the structure is what you should internalize.

The five commercial models

Commercial frontier pricing, January 2026, per provider docs
ModelInput ($/M)Output ($/M)Cache writeCache read
Claude Opus 4.7$5.00$25.00$6.25$0.50
Claude Sonnet 4.7$3.00$15.00$3.75$0.30
GPT-5$10.00$50.00n/a$5.00 (auto)
GPT-5 Mini$0.50$4.00n/a$0.25 (auto)
Gemini 3.1 Pro Preview$5.00$40.00$6.25$0.50

Two structural observations worth noting before the workloads. Output tokens cost three to five times what input tokens cost on every model. Prompt caching, where it exists, runs at roughly a tenth of the standard input cost. Any workload that repeatedly sends the same long context should be running with caching on. The savings aren't small.

Worth flagging up front: the numbers below assume well-formed prompts without conversational history bloat. The biggest hidden cost driver in production chat isn't the model, it's the unbounded context that grows turn over turn. Most teams I've seen in my own work with surprise bills had this problem more than a model-choice problem.

Chat-style workloads are cheap

A typical conversational exchange — user asks a question, model responds, conversation history grows across a few turns — averages roughly 2,000 input tokens and 500 output tokens per turn. On Claude Sonnet 4.7, that's $0.014 per turn. On Claude Opus, $0.067. On GPT-5 Mini, $0.003. On Gemini 3.1 Pro Preview, $0.030.

For a chat product with 10,000 daily active users averaging five turns each per day, that works out to between $150 and $3,350 daily, depending on model choice. Annualized: roughly $55,000 to $1.22M. At this scale your model choice matters more than your user count.

The hidden cost trap is unbounded conversation history. By turn 20 of a long conversation, the context can have grown to 10,000+ tokens, and the per-turn cost has tripled even though the marginal value to the user hasn't changed. The fix is to summarize or truncate history aggressively. Most chat products don't.

RAG gets expensive faster than you'd expect

A standard retrieval-augmented generation pattern — embed the query, retrieve five chunks from a vector store, stuff them into the prompt, generate a response — averages roughly 8,000 input tokens and 800 output tokens per query. On Opus, $0.18 per query. On Sonnet, $0.036. On GPT-5 Mini, $0.007. For the broader argument on when to use RAG vs. long context, see RAG vs fine-tuning.

At 1,000 queries per day, the daily cost ranges from $7 to $180. At 100,000 queries per day — a scale a popular product can reach faster than you expect — the daily cost ranges from $700 to $18,000.

Two specific things drive up the bill. First, RAG queries often pass through a re-ranking step that adds a model call. That's another input-token pass over the candidate chunks. Second, many production RAG systems retrieve more chunks than they need on the theory that more context can't hurt. It can. Every extra chunk is more input tokens on every single query.

Every team underestimates the output side of the bill. The output is where the money goes, and the output is the easiest part of the prompt to constrain.

Cost per typical task across 6 use cases

Cents per request, Claude Sonnet 4.7. Lower is better.

Classification
$0.0006
Simple chat turn
$0.014
RAG query
$0.036
Document summary
$0.18
Long-doc analysis
$0.66
Agent session
$0.10–10+
Output tokens cost 5× input on most frontier models

Batch document processing

The pattern: take a corpus of N documents, send each through the model, get back a structured response. Classification, extraction, summarization at scale. The cost is linear in input tokens times document length plus output tokens times answer length, multiplied by N.

Where this gets surprising is when retries and partial failures get forgotten. A naive batch pipeline that re-runs every failed call is fine when the failure rate is 1%. It's a budget catastrophe when the failure rate is 8%, which isn't unusual when running against a noisy upstream service. The cost of the retries on a bad timeout policy can match or exceed the cost of the primary run.

The fix is to use the batch API tiers most providers offer at roughly 50% off the standard price, with 24-hour turnaround. For any workload that isn't latency-sensitive, this is free money. Most teams don't use it. If the workload is small enough to fit on a local 4B model, see small language models for an even cheaper option.

(A side-note: the $50+ agent session figure isn't hypothetical. I've seen three real instances over the past year — two from teams I've consulted with, one from my own testing — where an agent loop ran for hours before someone noticed. Two were caught by per-day spending caps. One ran until the API key hit the monthly limit.)

Agent loops are the real danger

Agent workflows are where the surprising bills come from. The cost shape is hostile to estimation. An agent that makes three tool calls and returns is cheap. An agent that gets stuck in a loop and makes 200 tool calls before timing out is two orders of magnitude more expensive. Both happen. The first happens by design. The second happens because something upstream broke, the agent didn't notice, and it kept retrying with subtly different prompts.

The schema-change incident is common enough to recognize. A database schema changes. The tool the agent depends on starts returning errors. The agent keeps calling the tool, getting errors, asking the model what to do, and trying again. The model says try again with a different approach. The loop runs for hours before a human notices.

The defenses are well-known and worth implementing. Hard token caps per session. Hard wall-clock timeouts. Automatic alerts when per-session cost crosses a threshold. A per-day spending cap at the provider level. Anthropic's spending limits, OpenAI's usage limits — turn these on, even when the system feels overcautious. Especially when it feels overcautious. For more on agent failure modes, see AI agents, eighteen months in.

(A side note that didn't fit anywhere clean: most teams I've seen don't actually verify their cache-hit rate. They turn on caching in the SDK config, ship, and assume it's working. The actual cache-hit rate often runs lower than expected because prompts drift token-by-token across sessions — a timestamp, a session ID, a small variation that breaks the cache. Audit the bill against the cache-hit metric. If the system pays full price on every call, the discount isn't actually landing.)

Caching changes the math a lot

Prompt caching is the biggest cost-saver introduced in the past two years and the least-talked-about. If your workload sends a large fixed context with each request — a long system prompt, a fixed set of documents, a reference table — caching that prefix drops the input cost on later calls to roughly 10% of standard. For a RAG system with a 5,000-token system prompt firing 100,000 times a day, the savings on Opus alone come to about $675 per day, or roughly $245,000 per year.

This isn't implemented as often as it should be. Most teams either don't know caching exists, don't know which prefix to cache, or have set it up incorrectly and aren't getting the discount. Audit your bill against your cache-hit rate. If the system pays full input price on every call, real money is being left on the table.

Chat

Haiku 4.5 $0.80/1M in

Coding

Opus 4.7 Pay for correctness

RAG

Sonnet 4.6 Sweet spot at scale

Agents

Opus 4.7 Cap your budget

Classification

Phi-4 mini Local, ~$0 per task

Summarization

Gemini 3 Flash $0.30/$2.50 per 1M
1. What's the workload?

Chat, RAG, agent, classification, summarization?

2. How sensitive to wrong answers?

Production code = pay for Opus. Email triage = small model.

3. Volume?

Above 100K req/day, every cent matters. Cache aggressively.

4. Pick model + cap budget

Set a spending limit before scaling. Always.

The rough per-call cost worth keeping in your head

Per-call cost by workload shape, January 2026 pricing
WorkloadOpusSonnetGPT-5 Mini
Simple chat turn (2k / 500)$0.022$0.014$0.003
RAG query (8k / 800)$0.06$0.036$0.007
Document summary (50k / 2k)$0.30$0.18$0.033
Long-doc analysis (200k / 4k)$1.10$0.66$0.116
Agent session (variable)$0.50 – $50+$0.10 – $10+$0.02 – $2+

The pattern I've seen work most often in 2026: in my experience, estimate the workload shape before building. Estimate the workload shape before building. Pick the model that matches the shape, not the model with the best marketing. Use caching on every workload that re-sends a fixed prefix. Put hard caps on every agent loop. Run batch workloads through the batch tier when latency permits. These aren't exotic optimizations. They're table stakes most teams skip.

For solo founders or small teams: start production workloads on Claude Sonnet 4.7 or GPT-5 Mini. Save Opus and GPT-5 for the calls that actually need frontier capability. Watch the bill weekly. The cost dynamics shift quickly. Last month's bill isn't next month's bill once traffic moves.

Turn on the spending limits. Today. The four-figure surprise invoice isn't a lesson worth learning the slow way.

Bottom line

Run the cheapest model that wins each workload. Phi-4 mini for classification and routing. Sonnet 4.6 or Gemini Flash for routine generation. Opus or GPT-5 only for the calls that justify the spend. Cache aggressively (10% rate on cached prefixes). Constrain output tokens — output costs 3-5× input on every model. Cap your agent loops before they cap your budget.

Frequently asked

How much does it cost to run an AI chat product?

On Claude Sonnet 4.6, a typical chat turn (~2K input + 500 output tokens) costs $0.014. For 10,000 daily active users averaging 5 turns each, daily cost runs about $700, or $250K/year before any optimization.

Why do output tokens cost more than input?

Generation is harder than reading — more compute per token. Every model prices output at 3-5× input. Most teams budget around the input side and underestimate the output side, which is where the money actually goes.

What's the cheapest AI for high-volume work?

For frontier-class quality, GPT-5 Mini at $0.50 input / $4 output per million tokens. For open-weight on a hosted endpoint, Llama 4 Scout at $0.10 / $0.40. For self-hosted, Phi-4 mini at zero marginal cost after hardware.

How much does prompt caching save?

Cached prefix tokens run at ~10% of the standard input rate on Anthropic and Google. For a RAG system with a 5K-token system prompt firing 100,000 times daily, caching saves about $675/day on Opus alone.

How do I cap an AI agent's runaway spending?

Hard token caps per session, wall-clock timeouts, automatic alerts when per-session cost crosses a threshold, and per-day spending limits at the provider level. Turn on Anthropic and OpenAI's usage limits even when they feel overcautious.

Changelog

  • May 25, 2026 — Verified pricing against current provider documentation. Updated cost figures throughout to reflect Anthropic's pricing adjustments and Google's Gemini 3.1 Pro Preview rollout.
  • January 22, 2026 — Updated all pricing to reflect January 2026 rates including Sonnet 4.6 cache adjustment.
  • May 8, 2026 — Originally published.

References

  1. Anthropic, "Pricing," anthropic.com/pricing, accessed May 2026.
  2. OpenAI, "API Pricing," openai.com/api/pricing, accessed May 2026.
  3. Google, "Gemini API models," ai.google.dev/gemini-api/docs/models, accessed May 2026.
  4. Google Cloud, "Vertex AI generative AI pricing," cloud.google.com/vertex-ai/generative-ai/pricing, accessed May 2026.