Context window vs cost โ when a 2M window beats a 128K one
2026-04-22 ยท Choppy Toast
Long-context APIs are priced per token โ bigger window doesn't mean you pay for empty tokens, only what you send. So when does it matter?
Case A: One-shot long doc
You have a 500K-token contract to review. Options: - Gemini 2.5 Pro (2M context): one call, 500K ร $1.25/M = $0.63 input - Claude Sonnet 4.6 (200K context): can't fit โ need chunking + stitching
The 2M window wins outright when the doc exceeds the competitor's context.
Case B: Repeated long context
A coding agent that re-reads a 100K-token repo 1000 times/month: - Without cache: 100K ร 1000 ร input price. On Sonnet 4.6 (200K context): $300. On Gemini Pro: $125. - With cache: Sonnet 60% hit โ ~$132. Gemini 80% hit โ ~$50.
Caching often matters more than raw window size once you're inside the smaller model's limit.
Case C: Small, many requests
Customer chatbot, 1K tokens in, 200 out. The 2M context is dead weight. Haiku 4.5's 200K window is 200x more than you need. Per-token price wins.
Rule of thumb
- Single prompt exceeds 128K โ Gemini Pro or GPT-4.1 (1M) - Single prompt under 128K but repeated often โ prompt caching on any capable model - Single prompt under 16K โ ignore context, optimize on price + latency
Bigger isn't always cheaper. It's cheaper when your prompt actually uses it.