Three LLM routing strategies that actually cut costs

2026-04-22 ยท Choppy Toast

1. Classifier-first routing

Spend $0.10/M to ask a tiny model "is this request simple or hard?", then send 80% of the traffic to a cheap model and 20% to a smart one.

Math for a 100K-req/month app averaging 1K input / 500 output: - All Opus 4.7: $4,250 - All Haiku 4.5: $370 - Classifier (Flash-Lite) + 80% Haiku / 20% Opus: $110 classifier + $296 Haiku + $850 Opus = $1,256

Saves 70% vs all-Opus with barely measurable quality drop โ€” if the classifier is accurate enough.

2. Cascading

Try the cheap model first. If its confidence is low (log-probs, self-evaluation, or an explicit "did that answer work?" pass), retry on a smarter model.

Works great when ~70-85% of traffic is easy. Overhead per escalation is ~1 extra model call, so break-even is when the expensive-to-cheap price ratio exceeds ~5x, which every provider pair satisfies.

3. Confidence-based

Use the cheap model's own estimate of how sure it is. For classification and extraction tasks, log-prob thresholds give clean escalation signals. Harder to calibrate for open generation.

What usually doesn't work

Rule-based routing on prompt length. Token count is not a great proxy for difficulty โ€” a 200-token math problem is harder than a 10K-token "summarize this meeting."

The easy win

Most teams haven't tried even #1. A Friday-afternoon classifier in front of your API cuts bills more than any model switch will.