Building Production-Grade RAG Pipelines: A Practical Guide

Why Most RAG Pipelines Fail in Production

RAG (Retrieval-Augmented Generation) is conceptually simple: retrieve relevant documents, stuff them into the LLM's context, and generate an answer. The demo takes 30 minutes to build. The production system takes months.

The gap between demo and production is where most teams struggle. We have built RAG systems that serve thousands of users daily, and the lessons are consistent: the hard problems are not about the LLM — they are about the retrieval.

Step 1: Document Processing & Chunking

Chunking strategy is the most underrated decision in RAG pipeline design. Get it wrong and no amount of embedding optimization will save you.

Chunking Strategies That Work

Semantic chunking — split on topic boundaries, not arbitrary token limits. Use heading structure, paragraph breaks, and semantic similarity to find natural split points
Recursive character splitting — LangChain's RecursiveCharacterTextSplitter works well as a baseline. Split on paragraphs first, then sentences, then characters
Parent-child chunking — store small chunks for retrieval (better precision) but return the parent chunk for context (better comprehension). This is the single most impactful RAG optimization we have deployed

Chunk Size Guidelines

256-512 tokens — best for precise factual retrieval (Q&A, fact lookup)
512-1024 tokens — best for general knowledge retrieval (research, analysis)
1024-2048 tokens — best for complex topics that need more context

Always test with your actual data. The "right" chunk size depends on document structure, query patterns, and your embedding model's training distribution.

Step 2: Embedding Model Selection

Your embedding model determines the ceiling of your retrieval quality. No amount of post-retrieval processing can fix bad embeddings.

Current Best Options (2026)

OpenAI text-embedding-3-large — best general-purpose embedding. Good performance across domains, reasonable cost. Our default recommendation.
Cohere embed-v3 — excellent for multilingual and cross-lingual retrieval. Slightly better than OpenAI on domain-specific benchmarks.
Open-source (e5-large-v2, bge-large) — good performance, self-hostable. Best for teams with data privacy requirements or high-volume workloads where API costs matter.

Embedding Best Practices

Match your embedding model's dimensionality to your vector store's capabilities
Use the same embedding model for documents and queries — mixing models destroys retrieval quality
Consider instruction-tuned embeddings that take a task prefix ("search_document:" vs "search_query:") for better retrieval alignment

Step 3: Vector Store Architecture

Choosing a Vector Store

Vector Store	Best For	Scale
pgvector	Startups, existing Postgres infra	Up to ~5M vectors
Pinecone	Managed, serverless, production scale	Billions of vectors
Weaviate	Hybrid search (vector + keyword), self-hosted	Hundreds of millions
Chroma	Prototyping, local development	Up to ~1M vectors
Qdrant	High-performance, filtering, self-hosted	Hundreds of millions

Our default recommendation for most startups: start with pgvector. You already have Postgres, it is easy to maintain, and it handles millions of documents without issues. Move to a dedicated vector store when (not before) you hit scale limitations.

Step 4: Retrieval Optimization

Raw vector similarity search is rarely good enough for production. These techniques improve retrieval quality significantly:

Hybrid Search

Combine vector (semantic) search with keyword (BM25) search. Vector search finds semantically related content; keyword search catches exact matches that semantic search misses. Use Reciprocal Rank Fusion (RRF) to merge results from both methods.

Re-ranking

After initial retrieval, re-rank results with a cross-encoder model. Cross-encoders are more accurate than bi-encoder embeddings but too slow for first-stage retrieval. The two-stage pipeline (fast retrieval → accurate re-ranking) is the industry standard:

Retrieve top 20-50 candidates via vector + keyword search
Re-rank with Cohere Rerank or a cross-encoder model
Pass top 3-5 results to the LLM

Query Decomposition

Complex queries often need to be broken into sub-queries for better retrieval. "Compare our Q1 and Q2 revenue and explain the difference" should become two retrieval queries: one for Q1 data, one for Q2 data.

Step 5: Evaluation Framework

You cannot improve what you cannot measure. Build evaluation into your RAG pipeline from day one.

Key Metrics

Retrieval precision — what fraction of retrieved documents are relevant?
Retrieval recall — what fraction of relevant documents were retrieved?
Answer faithfulness — does the generated answer stay grounded in the retrieved context?
Answer relevancy — does the answer actually address the user's question?

Building an Eval Dataset

Create a dataset of 50-100 question-answer pairs with ground truth source documents. Use this dataset to benchmark every pipeline change. Automated evaluations are useful for regression testing, but periodic human evaluation is essential for catching subtle quality issues.

Common Failure Modes

1. The "Wrong Document" Problem

The retriever finds documents that are topically related but do not contain the answer. Fix: improve chunking strategy, add metadata filtering, use hybrid search.

2. The "Lost in the Middle" Problem

LLMs pay less attention to information in the middle of the context window. Fix: put the most relevant documents first and last. Limit context to 3-5 highly relevant chunks rather than 15-20 marginally relevant ones.

3. The "Hallucination" Problem

The LLM generates plausible-sounding information that is not in the retrieved context. Fix: add citation requirements to the system prompt, implement factual consistency checks, and use lower temperature settings.

4. The "Stale Data" Problem

The vector store contains outdated information. Fix: implement incremental indexing with document lifecycle management. Track document versions and automatically re-index when sources change.

Production Checklist

Before deploying a RAG pipeline to production, verify:

Evaluation dataset with 50+ question-answer pairs and ground truth
Retrieval precision above 80% on your eval dataset
Hallucination rate below 5% on factual queries
P95 latency under 3 seconds for end-to-end query response
Incremental indexing pipeline for keeping data current
Monitoring and alerting on retrieval quality metrics
Fallback behavior when retrieval returns no relevant results
Cost tracking per query (embedding + LLM inference)

Building a production RAG pipeline is an engineering challenge, not a research challenge. The techniques are well-understood. The hard part is rigorous implementation, thorough evaluation, and operational discipline.