How to pick an LLM for RAG search
In RAG, retrieval quality matters more than generation quality. If your retriever returns the right chunks, any decent model can write the answer. If it doesn't, no model can fix it.
Three tiers that work for RAG:
Budget: Gemini 2.5 Flash-Lite ($0.10 / $0.40). Handles 1M context if you want to skip chunking. Good at citation-style answers.
Standard: GPT-4o mini or Haiku 4.5. Slight quality upgrade on synthesis and edge cases.
When you need it: Sonnet 4.6 or Gemini 2.5 Pro. Multi-hop questions that require reasoning across 3+ retrieved chunks.
What to actually optimize:
1. Chunk quality. Spend a week on chunking strategy before upgrading the LLM. 2. Re-ranking. A $0.01 re-rank pass on top-50 → top-5 beats any model upgrade. 3. Caching the system prompt. Your instructions don't change per query. 4. Structured output. JSON mode forces focused answers, saving output tokens.
Example bill (10K queries/month, avg 4K retrieved context + 500-token question, 400-token answer): - Flash-Lite: $19 - GPT-4o mini: $29 - Haiku 4.5 (cache): $48 - Sonnet 4.6 (cache): $138
A $19/month RAG backbone leaves room to spend on a real re-ranker or better embeddings — which almost always moves quality more than a bigger generator.