Question 1

Which LLM API is cheapest in April 2026?

Accepted Answer

For usable general-purpose LLMs, Google's Gemini 2.5 Flash-Lite is the cheapest at $0.10 per million input tokens and $0.40 per million output tokens. GPT-4o mini is close behind at $0.15/$0.60. DeepSeek V3.1 is the cheapest in the 'smart enough for production' tier at $0.27/$1.10. Among frontier models, Gemini 2.5 Pro dominates at $1.25/$10, undercutting GPT-5 ($10/$30) and Claude Opus 4.7 ($15/$75) for most workloads.

Question 2

How do I calculate LLM API cost?

Accepted Answer

Multiply input tokens per request × request count × input price per million ÷ 1,000,000, then do the same for output, then add them. Always use the cached input rate for any portion you expect to cache. A rough shorthand: English text is ~4 characters per token, so 1 word ≈ 1.3 tokens. Reasoning models (o3, Claude extended thinking) may bill for invisible thinking tokens that multiply output cost by 5-20x.

Question 3

What is the Claude Opus 4.7 API price?

Accepted Answer

Claude Opus 4.7 costs $15 per million input tokens, $75 per million output tokens, and $1.50 per million cached input tokens. It also has a 25% write surcharge on the first cache-write call. The 10% cached rate is the most aggressive of any frontier model, which matters for workloads that reuse prompt prefixes.

Question 4

What does GPT-5 cost per token?

Accepted Answer

GPT-5 costs $10 per million input tokens, $30 per million output tokens, and $2.50 per million cached input tokens as of April 2026. GPT-5 mini is dramatically cheaper at $0.25 in and $2 out. OpenAI automatically caches identical prompt prefixes, so the cached discount requires no manual configuration unlike Anthropic's explicit cache_control flag.

Question 5

Is Gemini 2.5 Pro really cheaper than Claude Sonnet 4.6?

Accepted Answer

Yes, for most workloads. Gemini 2.5 Pro is $1.25 in / $10 out, while Sonnet 4.6 is $3 in / $15 out. Gemini also offers 2M context vs Sonnet's 200K, and cheaper cached input ($0.31 vs $0.30 per million). Sonnet still wins on tool use reliability, agentic coding, and instruction following — so for coding agents and complex workflows, the premium is often justified.

Question 6

What is prompt caching and how much does it save?

Accepted Answer

Prompt caching stores a prompt prefix on the provider's servers so repeated requests reuse it at a steep discount. Anthropic charges 10% of regular input for cached tokens. OpenAI charges 25-50%. Google charges 25% plus a storage fee. For production chatbots with stable 2-5K-token system prompts hit thousands of times, caching typically cuts total input cost by 60-80%.

Question 7

What are reasoning tokens and how do they affect cost?

Accepted Answer

Reasoning tokens (OpenAI o-series) and extended thinking tokens (Anthropic) are output tokens the model generates privately to work through a problem before showing an answer. The API doesn't return them but bills for them as output tokens. On o3, a short 200-word visible answer can hide 5,000-50,000 reasoning tokens, multiplying effective output cost by 5-20x. Always set max_tokens limits.

Question 8

How do I cut my LLM API bill?

Accepted Answer

Five highest-impact moves: (1) Turn on prompt caching for your system prompt and tool definitions. (2) Trim tool schemas — drop unused tools per request. (3) Cap max_tokens strictly — don't trust the model to stop. (4) Add a cheap classifier in front to route easy requests to a smaller model. (5) Use Batch API for anything non-user-facing (50% discount on Anthropic, OpenAI, Google).

Question 9

Is DeepSeek cheaper than OpenAI?

Accepted Answer

Only sometimes. DeepSeek V3.1 at $0.27/$1.10 is about 2x cheaper than GPT-4o mini ($0.15/$0.60) on input but more expensive on output. For reasoning workloads, DeepSeek R1 ($0.55/$2.19) is roughly half the price of o4-mini ($1.10/$4.40) and a fraction of o3. DeepSeek's open-weights also let you self-host, which nothing from OpenAI allows.

Question 10

Which LLM has the largest context window?

Accepted Answer

Gemini 2.5 Pro has 2 million tokens (Enterprise tier), with 1 million standard. GPT-4.1 has 1 million. GPT-5 has 400K. Claude models top out at 200K. Llama 4 Scout has 10 million on paper but recall drops off past 1 million in practice. For documents over 200K tokens, Gemini 2.5 Pro or GPT-4.1 are the only serious choices.

Question 11

Is open-source LLM hosting cheaper than using APIs?

Accepted Answer

Usually not for cost alone. Hosted open-weights like Llama 3.3 70B on Groq ($0.59/$0.79) don't undercut Gemini Flash-Lite ($0.10/$0.40) or GPT-4o mini ($0.15/$0.60). Self-hosting on an H100 ($2/hr rental) breaks even around 100% GPU utilization, which almost nobody achieves. Open-weights win for data residency, fine-tuning, or ultra-low-latency inference — not pure cost.

Question 12

What is the Anthropic Batch API discount?

Accepted Answer

Anthropic Batch API offers 50% off both input and output prices, with results delivered within 24 hours. It's available on all Claude models. OpenAI and Google offer the same 50% batch discount. Batch is perfect for evaluations, offline classification, content generation for blogs, and data enrichment — anything that isn't user-facing.

Question 13

How does context window affect pricing?

Accepted Answer

Context window is a ceiling, not a price driver — you pay only for tokens you actually send, not for the unused space. A 2M-context model doesn't cost more for a 1K-token prompt than a 128K-context model. The tradeoff is that bigger windows sometimes come with higher per-token rates, but Gemini 2.5 Pro proves this isn't a rule — it has the biggest window and one of the lowest prices.

Question 14

Why does my LLM bill keep going up?

Accepted Answer

The usual suspects: (1) Tool definitions bloating input tokens on every call — a 15-tool agent adds 3-5K tokens per request. (2) Chat history replay on each turn — a 20-turn conversation is ~55x the tokens of the first turn. (3) Unlimited max_tokens letting the model ramble. (4) Retries on JSON-mode failures. (5) Image inputs at 1500-3000 tokens each. Caching fixes most of these.

Question 15

What's the cheapest way to run a chatbot?

Accepted Answer

Gemini 2.5 Flash-Lite with aggressive prompt caching. For 10,000 conversations per month at 600 input / 200 output per turn with a 4K cached system prompt, you're looking at under $10/month. GPT-4o mini is close behind at ~$15/month. DeepSeek V3.1 is $20-40. Any frontier model starts at 10-30x that cost, which is rarely worth it for simple customer chatbots.

LLM API Pricing Calculator

What does an LLM API cost per token?

📖 Latest posts

🎯 Guides

⚖️ Head-to-head

Frequently Asked Questions