RAG vs fine-tuning: how to decide for a real product
RAG is cheaper per query and updates instantly. Fine-tuning costs more and takes days to train. Neither is obviously better. The answer depends on your data freshness tolerance, latency budget, and how much domain knowledge your model actually needs to internalize.
Every AI product team asks the same question. Should we fine-tune a model or build RAG. The literature gives you definitions. RAG retrieves documents. Fine-tuning updates weights. But that does not tell you which to build first. This post works through five questions that separate good decisions from expensive ones.
What RAG actually does
RAG means LangChain.com/docs/use_cases/question_answering/" target="_blank" rel="noopener">Retrieval-Augmented Generation. You take a user query. You search your document collection with an embedding model (OpenAI\'s `text-embedding-3-large` or open source alternatives). You find the 5-10 most similar documents. You pass those documents to Claude or GPT-4 as context. The model generates a response grounded in those documents.
RAG is fast. Search takes 50-200 milliseconds. Inference takes 1-5 seconds depending on response length. Total latency is 1.5-6 seconds per query. It is cheaper. You pay for embedding calculation (a few cents per 1M documents) and LLM tokens (which only cover your context window, not training).
RAG updates instantly. Index a new document at 5pm. Search finds it at 5:01pm. You do not wait for the next training run. The trade-off is hallucination. The model will sometimes make up facts that are not in your documents. The vector search must retrieve the right documents. If you index poorly, the model cannot help you.
A vector database is the enabling technology. Pinecone is easiest to start. Weaviate and Qdrant are open source. PostgreSQL with pgvector works if you already use Postgres. The choice matters at scale (100M+ vectors) but not at the start.
What fine-tuning actually does
Fine-tuning updates the model weights using your own data. You provide 100-10,000 examples of input and output. You pay OpenAI or Anthropic. They run training for 12-72 hours. The model learns to mimic the pattern in your data.
Fine-tuning is slow. You cannot iterate quickly. Generate new training data, wait for a run, evaluate. That cycle takes days. You cannot fine-tune continuously. The API lets you train once a week at most without overheating the setup.
Fine-tuning is expensive. A 7B model costs $50-200k to fine-tune. Claude 3 fine-tuning costs $3 per 1M input tokens and $15 per 1M output tokens during training. If your training set is 100k examples of 500 tokens each, the cost is $150-300. At inference time, fine-tuned models cost 10-30 percent more per token than base models.
Fine-tuning encodes knowledge permanently. The model learns to understand domain terminology. It learns your style. It learns when to be concise versus verbose. RAG cannot do this as well. RAG passes knowledge as context. Fine-tuning makes it part of the model.
The decision framework: five questions
Start with question 1. If you answer no, stop. You probably do not need fine-tuning.
- Is your data stable for 6+ months? Fine-tuning assumes your knowledge domain does not change dramatically. If your training data will be obsolete in 2 months, fine-tuning does not work. RAG handles this. New documents get indexed. Search finds them. If your product is a customer support agent and your company ships a new feature every 2 weeks, you need RAG, not fine-tuning.
- Do you need sub-2-second latency? RAG adds 1.5-2 seconds to response time. If your UI cannot tolerate that, you need fine-tuning or a smaller model. Most AI products can afford 2-5 seconds. If you cannot, RAG is still viable with aggressive caching and streaming tokens to the user while the LLM thinks.
- Is your knowledge primarily in documents or primarily in learned style? A document retrieval system (legal search, technical docs) is RAG. A coding assistant needs to know Python semantics and best practices. That is learned style. That is fine-tuning. A customer support agent that needs to know your product but also needs documents is a hybrid.
- What is your cost ceiling per query? RAG costs $0.005-0.02 per query (search plus tokens). Fine-tuning costs $0.001-0.005 per token at inference but has a high upfront training cost. At 100k queries per month, RAG costs $500-2,000. Fine-tuning costs $1,500-5,000 per month plus $500-2,000 training overhead. Break-even is around 1M queries per month.
- How much will the model hallucinate if grounded only in documents? Some tasks need the model to reason and synthesize. RAG can handle this. Other tasks need ground truth and zero hallucination. A medical diagnosis tool needs RAG plus validation. A financial advisor needs RAG plus expert review. Do not rely on fine-tuning to eliminate hallucination. It does not.
This framework does not tell you fine-tuning is better. It tells you when to consider it. Most teams should start with RAG.
Real numbers from production
| Metric | RAG (200k docs) | Fine-tuning (7B model) | Winner |
|---|---|---|---|
| Cost per 1k queries | $4-8 | $1.50 (after amortized training) | Fine-tuning at scale |
| Latency (p95) | 3-4 seconds | 2-3 seconds | Fine-tuning |
| Time to production | 2-3 days | 2-3 days (training) plus setup | RAG |
| Update frequency | Real-time (within 1 hour) | Weekly or monthly | RAG |
| Accuracy on retrieval | 88-92 percent (search) | 92-96 percent (learned) | Fine-tuning |
| Hallucination rate | 2-5 percent | 1-3 percent | Fine-tuning |
The latency advantage of fine-tuning is overstated. Most of fine-tuning's speed gain comes from simpler prompts, not from the model itself. With RAG, you can write longer prompts. With fine-tuning, you can write short prompts. The model carries the context in its weights.
The accuracy gap is real but smaller than it looks. Fine-tuning learns patterns. RAG finds documents. On retrieval tasks, fine-tuning is better. On reasoning tasks, RAG plus a good model is competitive.
The hybrid pattern: when it actually pays
The best approach for many teams is neither RAG nor fine-tuning alone. It is both.
Use RAG for document retrieval and current information. The documents ground the response. Use fine-tuning on a small 3B or 7B model for style and format. The fine-tuned model learns to write in your domain's voice. It learns which fields matter for your customers. It learns to structure output in a specific way. It does not learn retrieval. Retrieval is a search problem. Style is a generation problem.
This hybrid pattern costs more than RAG alone. Training a 7B model on 10k examples costs $100-300. Inference costs 2-3x more than base models. But you get instant updates (via RAG), consistent style (via fine-tuning), and lower hallucination (documents ground the response).
The hybrid pattern is worth the cost if:
- You have 10k+ labeled examples of desired output.
- Your domain has specific terminology your model needs to internalize.
- Your users care about consistency and style, not just accuracy.
- You are willing to maintain two systems (vector DB plus fine-tuned model).
Most teams start with RAG. A year in, they fine-tune. Two years in, they realize fine-tuning was not worth it and return to RAG with better prompts.
Where this lives at Empyreal
We build AI products that use RAG, fine-tuning, or both. Every decision depends on your specific constraints. Our experience: RAG is the right default. Fine-tuning is the right second choice if retrieval is too slow or your training data is stable enough. If you are building an AI product and want to avoid expensive mistakes, start with our LangChain service. We handle RAG architecture, vector database selection, and retrieval optimization. For Claude integration specifically, see our Claude API service. And if you want to explore fine-tuning, see OpenAI integration. Read more about RAG in our glossary, and about vector databases.