Context window vs cost — when a 2M window beats a 128K one

2026-04-22 · Choppy Toast

Long-context APIs are priced per token — bigger window doesn't mean you pay for empty tokens, only what you send. So when does it matter?

Case A: One-shot long doc

You have a 500K-token contract to review. Options: - Gemini 2.5 Pro (2M context): one call, 500K × $1.25/M = $0.63 input - Claude Sonnet 4.6 (200K context): can't fit — need chunking + stitching

The 2M window wins outright when the doc exceeds the competitor's context.

Case B: Repeated long context

A coding agent that re-reads a 100K-token repo 1000 times/month: - Without cache: 100K × 1000 × input price. On Sonnet 4.6 (200K context): $300. On Gemini Pro: $125. - With cache: Sonnet 60% hit → ~$132. Gemini 80% hit → ~$50.

Caching often matters more than raw window size once you're inside the smaller model's limit.

Case C: Small, many requests

Customer chatbot, 1K tokens in, 200 out. The 2M context is dead weight. Haiku 4.5's 200K window is 200x more than you need. Per-token price wins.

Rule of thumb

- Single prompt exceeds 128K → Gemini Pro or GPT-4.1 (1M) - Single prompt under 128K but repeated often → prompt caching on any capable model - Single prompt under 16K → ignore context, optimize on price + latency

Bigger isn't always cheaper. It's cheaper when your prompt actually uses it.