Retrieval augmented generation (RAG) is the most practical AI pattern for most applications. Instead of fine tuning a model on your data (expensive, slow, and often unnecessary), you retrieve relevant documents at query time and feed them as context. Here is how to implement it properly.
What RAG Actually Does
RAG solves a simple problem: LLMs know a lot about the world but nothing about your specific data. When a user asks "what is our refund policy?" the model cannot answer from its training data. RAG retrieves your actual refund policy document and gives the model the context it needs.
The pipeline: user query goes in, gets converted to an embedding vector, nearest neighbors are found in your vector database, those documents get stuffed into the LLM prompt as context, and the model generates an answer grounded in your actual data.
Choosing an Embedding Model
Your embedding model converts text into vectors. The quality of these embeddings directly determines retrieval quality. Options we have used in production:
OpenAI text-embedding-3-small, best balance of quality and cost at $0.02 per million tokens. 1536 dimensions. Good enough for 90% of use cases.
OpenAI text-embedding-3-large, better quality, 3072 dimensions, $0.13 per million tokens. Use when retrieval precision is critical (legal, medical).
Open source (BGE, E5), free, self hosted. Quality is 5-10% behind OpenAI but improving fast. Use when data cannot leave your infrastructure.
Vector Database Selection
You need somewhere to store and query your embeddings. Our recommendations:
pgvector (PostgreSQL extension), if you already use PostgreSQL, start here. No new infrastructure, good enough for up to 1M documents, and you can join vector search results with your relational data. This is what we use on most production applications.
Pinecone, managed service, scales well, good developer experience. $70/month for the starter tier. Choose this if you need to scale beyond what pgvector handles comfortably.
Qdrant or Weaviate, self hosted alternatives with more features than pgvector and lower cost than Pinecone at scale.
Chunking Strategy Matters More Than You Think
How you split documents into chunks is the single biggest lever for retrieval quality. Bad chunking means the right information exists in your database but never gets retrieved.
Chunk size: 200-500 tokens is the sweet spot for most content. Too small and you lose context. Too large and you dilute relevance. We typically use 300 tokens with 50-token overlap between chunks.
Chunk boundaries: Split on semantic boundaries (paragraphs, sections) not arbitrary character counts. A chunk that starts mid sentence retrieves poorly.
Metadata enrichment: Attach source document title, section heading, and document type to each chunk. This enables filtered retrieval, search only product docs, only FAQs, only for a specific product.
Retrieval Quality Optimization
The retrieval step is where most RAG systems fail. Fixes we have applied in production:
Hybrid search. Combine vector similarity with keyword search (BM25). Vector search is great for semantic similarity but misses exact matches. Keyword search handles product names, error codes, and specific terms. A 70/30 blend of vector and keyword typically outperforms either alone.
Reranking. Retrieve 20 candidates, then use a cross encoder model to rerank them down to the top 5. This adds 200-300ms of latency but significantly improves relevance. Cohere's reranker API costs $1 per 1000 queries.
Query expansion. Rephrase the user's question 2-3 ways and search with all variants. A user asking "how do I cancel?" and one asking "unsubscribe from my plan" should find the same document.
When RAG Beats Fine Tuning
RAG wins for: frequently changing data, factual Q&A, multiple data sources, and when you need citations. Fine tuning wins for: changing the model's style or behavior, domain specific terminology, and tasks where response format matters more than factual grounding.
In practice, 90% of AI integration projects we build use RAG. Fine tuning is reserved for narrow, specialized use cases. If you are building AI features for a SaaS product, RAG is almost certainly where you should start.
Evaluating RAG Quality
The biggest mistake teams make with RAG is deploying without an evaluation pipeline. A chunking change that improves one category of questions might break another. An embedding model upgrade that scores better on benchmarks might perform worse on your specific domain. You cannot improve what you cannot measure, so build an evaluation pipeline from day one:
Ground truth dataset. Create 100-200 question answer pairs from your actual data. Include easy questions (directly stated in one document), hard questions (require combining information from multiple sources), and unanswerable questions (the answer is not in your data, the system should say so).
Retrieval metrics. Measure recall at K: for each question, do the top K retrieved chunks contain the relevant information? Target 85%+ recall at K=5. If retrieval is poor, better prompting will not save you. Fix the chunking and embedding first.
Answer quality. Use an LLM as judge approach: have GPT-4o compare generated answers against ground truth for factual accuracy, completeness, and hallucination. Automate this as a CI check that runs on every change to your retrieval pipeline.
Latency tracking. Measure end to end latency: embedding generation (20-50ms), vector search (10-30ms), reranking (200-300ms if used), and LLM generation (500-2000ms). Users expect answers in under 3 seconds for interactive search.
Real Costs
For a production RAG system processing 10K queries/day: embedding API costs run $5-15/month, vector database hosting $70-200/month, LLM API costs $300-800/month (the largest expense by far). Total: $400-1000/month. Compare that to the cost of manually answering 10K questions.
Our AI integration team builds RAG systems that go from concept to production in 4-6 weeks. We handle the hard parts, chunking optimization, hybrid search tuning, and evaluation pipelines that measure actual retrieval quality.
Need AI powered search in your app? Tell us about your data.