Why Most RAG Pipelines Fail in Production
RAG (Retrieval-Augmented Generation) is conceptually simple: retrieve relevant documents, stuff them into the LLM's context, and generate an answer. The demo takes 30 minutes to build. The production system takes months.
The gap between demo and production is where most teams struggle. We have built RAG systems that serve thousands of users daily, and the lessons are consistent: the hard problems are not about the LLM — they are about the retrieval.
Step 1: Document Processing & Chunking
Chunking strategy is the most underrated decision in RAG pipeline design. Get it wrong and no amount of embedding optimization will save you.
Chunking Strategies That Work
- Semantic chunking — split on topic boundaries, not arbitrary token limits. Use heading structure, paragraph breaks, and semantic similarity to find natural split points
- Recursive character splitting — LangChain's RecursiveCharacterTextSplitter works well as a baseline. Split on paragraphs first, then sentences, then characters
- Parent-child chunking — store small chunks for retrieval (better precision) but return the parent chunk for context (better comprehension). This is the single most impactful RAG optimization we have deployed
Chunk Size Guidelines
- 256-512 tokens — best for precise factual retrieval (Q&A, fact lookup)
- 512-1024 tokens — best for general knowledge retrieval (research, analysis)
- 1024-2048 tokens — best for complex topics that need more context
Always test with your actual data. The "right" chunk size depends on document structure, query patterns, and your embedding model's training distribution.
Step 2: Embedding Model Selection
Your embedding model determines the ceiling of your retrieval quality. No amount of post-retrieval processing can fix bad embeddings.
Current Best Options (2026)
- OpenAI text-embedding-3-large — best general-purpose embedding. Good performance across domains, reasonable cost. Our default recommendation.
- Cohere embed-v3 — excellent for multilingual and cross-lingual retrieval. Slightly better than OpenAI on domain-specific benchmarks.
- Open-source (e5-large-v2, bge-large) — good performance, self-hostable. Best for teams with data privacy requirements or high-volume workloads where API costs matter.
Embedding Best Practices
- Match your embedding model's dimensionality to your vector store's capabilities
- Use the same embedding model for documents and queries — mixing models destroys retrieval quality
- Consider instruction-tuned embeddings that take a task prefix ("search_document:" vs "search_query:") for better retrieval alignment
Step 3: Vector Store Architecture
Choosing a Vector Store
| Vector Store | Best For | Scale |
|---|---|---|
| pgvector | Startups, existing Postgres infra | Up to ~5M vectors |
| Pinecone | Managed, serverless, production scale | Billions of vectors |
| Weaviate | Hybrid search (vector + keyword), self-hosted | Hundreds of millions |
| Chroma | Prototyping, local development | Up to ~1M vectors |
| Qdrant | High-performance, filtering, self-hosted | Hundreds of millions |
Our default recommendation for most startups: start with pgvector. You already have Postgres, it is easy to maintain, and it handles millions of documents without issues. Move to a dedicated vector store when (not before) you hit scale limitations.
Step 4: Retrieval Optimization
Raw vector similarity search is rarely good enough for production. These techniques improve retrieval quality significantly:
Hybrid Search
Combine vector (semantic) search with keyword (BM25) search. Vector search finds semantically related content; keyword search catches exact matches that semantic search misses. Use Reciprocal Rank Fusion (RRF) to merge results from both methods.
Re-ranking
After initial retrieval, re-rank results with a cross-encoder model. Cross-encoders are more accurate than bi-encoder embeddings but too slow for first-stage retrieval. The two-stage pipeline (fast retrieval → accurate re-ranking) is the industry standard:
- Retrieve top 20-50 candidates via vector + keyword search
- Re-rank with Cohere Rerank or a cross-encoder model
- Pass top 3-5 results to the LLM
Query Decomposition
Complex queries often need to be broken into sub-queries for better retrieval. "Compare our Q1 and Q2 revenue and explain the difference" should become two retrieval queries: one for Q1 data, one for Q2 data.
Step 5: Evaluation Framework
You cannot improve what you cannot measure. Build evaluation into your RAG pipeline from day one.
Key Metrics
- Retrieval precision — what fraction of retrieved documents are relevant?
- Retrieval recall — what fraction of relevant documents were retrieved?
- Answer faithfulness — does the generated answer stay grounded in the retrieved context?
- Answer relevancy — does the answer actually address the user's question?
Building an Eval Dataset
Create a dataset of 50-100 question-answer pairs with ground truth source documents. Use this dataset to benchmark every pipeline change. Automated evaluations are useful for regression testing, but periodic human evaluation is essential for catching subtle quality issues.
Common Failure Modes
1. The "Wrong Document" Problem
The retriever finds documents that are topically related but do not contain the answer. Fix: improve chunking strategy, add metadata filtering, use hybrid search.
2. The "Lost in the Middle" Problem
LLMs pay less attention to information in the middle of the context window. Fix: put the most relevant documents first and last. Limit context to 3-5 highly relevant chunks rather than 15-20 marginally relevant ones.
3. The "Hallucination" Problem
The LLM generates plausible-sounding information that is not in the retrieved context. Fix: add citation requirements to the system prompt, implement factual consistency checks, and use lower temperature settings.
4. The "Stale Data" Problem
The vector store contains outdated information. Fix: implement incremental indexing with document lifecycle management. Track document versions and automatically re-index when sources change.
Production Checklist
Before deploying a RAG pipeline to production, verify:
- Evaluation dataset with 50+ question-answer pairs and ground truth
- Retrieval precision above 80% on your eval dataset
- Hallucination rate below 5% on factual queries
- P95 latency under 3 seconds for end-to-end query response
- Incremental indexing pipeline for keeping data current
- Monitoring and alerting on retrieval quality metrics
- Fallback behavior when retrieval returns no relevant results
- Cost tracking per query (embedding + LLM inference)
Building a production RAG pipeline is an engineering challenge, not a research challenge. The techniques are well-understood. The hard part is rigorous implementation, thorough evaluation, and operational discipline.