benchr Issue No. 07

AI costs in 2026: a guide

What AI actually costs by workload, why long-context bills are surprising, and the math on every dollar you'll spend.

What this guide covers

Three articles, one buying decision. The price-per-use-case table breaks down what each workload actually costs across the major commercial models. The context-windows piece explains why advertised window numbers aren't what you can actually use. The million-token marketing piece argues that most long-context bills are wasted compared to a properly-built retrieval system.

Pricing by workload

  • Analysis · Apr 2026

    The price-per-use-case table

    Six workloads, three frontier models, the cheapest pick for each. Chat costs $0.014 per turn on Sonnet. RAG queries run $0.036. Document summaries climb to $0.18. Agent sessions can hit $50+ if you don't cap them.

Context-window economics

  • Analysis · Feb 2026

    Context windows compared, across four frontier models

    Advertised window vs effective retrieval zone. Claude says 1M, retrieves reliably to ~600K. Gemini says 2M, holds to ~800K. The gap matters — most teams price for the advertised window and pay for the effective one.

  • Essay · May 2026

    The million-token context was always a marketing number

    200K tokens on Claude Opus costs about $1 per query. The same answer via RAG costs $0.06. That's a 17× difference per question. At meaningful volume, the cost structure forces the architecture. Build for retrieval first.

When to skip the frontier entirely

  • Essay · Apr 2026

    RAG vs fine-tuning, with the math

    RAG wins almost every time. The three exceptions where fine-tuning earns its place, the math behind each, and the cost breakdown across approaches.

  • Review · Feb 2026

    Small language models, in working use

    Phi-4 mini hits 94% classification accuracy at $0 marginal cost. The 2-point gap to Sonnet 4.6 isn't worth $16 a day in API spend at that volume.

The cost discipline that actually works

Three rules from a year of watching production AI bills run away.

One: constrain output. Cap max-tokens. Force structured formats. Instruct "no preamble" and trim everywhere. Output is where the money goes. See the prompt-engineering piece for the techniques.

Two: cache the prefix. Anthropic, Google, and OpenAI all support prompt caching at ~10% of standard input rate. If your system prompt is the same on every call, you're paying 10× too much by not caching.

Three: route by workload. Use the small local model for classification. Use Sonnet or Flash for routine generation. Save Opus and GPT-5 for the calls that actually justify the spend. The comparison tool helps you scope which model fits which workload.