Open-weights models: when are they actually cheaper?

2026-04-22 ยท Choppy Toast

Open-weights doesn't mean free. You either rent inference from Groq, Together, Fireworks, DeepInfra โ€” or you run GPUs yourself.

Hosted open-weights pricing (April 2026): - Llama 3.3 70B on Groq: $0.59 in / $0.79 out - Llama 4 Scout on Groq: $0.11 in / $0.34 out - DeepSeek V3.1 (official): $0.27 / $1.10 - Qwen3 72B: $0.50 / $1.50

Comparison anchors: - GPT-4o mini: $0.15 / $0.60 - Gemini 2.5 Flash-Lite: $0.10 / $0.40 - Haiku 4.5: $1 / $5

At these price points, hosted open-weights don't dramatically undercut closed leaders. Gemini Flash-Lite is still the cheapest usable option.

When open-weights wins:

1. Data residency / on-prem. Customer demands no data leaves your VPC. Open models let you self-host; closed APIs don't. 2. Fine-tuning. You can LoRA-train on your task. Closed APIs mostly don't allow this, or cost 10x on the tuned-model inference. 3. Latency. Groq's LPU inference does 800+ tokens/sec on Llama. Nothing closed matches that today. 4. Reasoning at scale. DeepSeek R1 is the cheapest way to run chain-of-thought at scale ($0.55 / $2.19).

Self-hosting break-even

An H100 rental at $2/hr running Llama 3.3 70B (~40 tokens/sec) produces ~150K output tokens/hour โ†’ ~$13.30/M output tokens. You beat hosted ($0.79/M on Groq) only when your GPU is 100% busy, which almost nobody achieves for long.

The honest answer

Use hosted open-weights for the 3 specific reasons above. For cost alone, Gemini Flash-Lite and GPT-4o mini are hard to beat.