Every week, someone asks us to build "a chatbot that knows about our company." Nine times out of ten, what they actually need is a RAG system. And nine times out of ten, the first version they tried (usually a weekend hackathon project) gave terrible answers because the retrieval was poorly implemented.
We've built RAG systems for knowledge bases, customer support, legal document search, financial analysis, and internal company assistants. The pattern is the same every time — the difference between a RAG demo and a production RAG system is in the details. This guide covers those details.
What is RAG and Why Does It Matter?
Retrieval-Augmented Generation (RAG) is a technique that combines information retrieval with LLM text generation. Instead of relying solely on what the model learned during training, a RAG system fetches relevant documents from your data and includes them in the LLM's context, so it can generate answers grounded in your actual information.
Why not just fine-tune the model on your data? Three reasons:
- Cost: Fine-tuning is expensive ($500-50,000+ depending on data volume and model). RAG costs almost nothing beyond vector storage and embedding calls.
- Freshness: Fine-tuning creates a point-in-time snapshot. RAG always retrieves the latest data. When your docs update, RAG answers update automatically — no retraining needed.
- Traceability: RAG can cite its sources. "Based on your policy document updated March 2026, section 4.2..." Fine-tuned models can't tell you where their answer came from.
For most business applications, RAG is the right choice. Fine-tuning only makes sense when you need to change the model's behavior or writing style fundamentally, not just give it access to new information.
The RAG Architecture
A production RAG system has two main phases:
Phase 1: Ingestion Pipeline (Offline)
- Load documents — PDFs, web pages, databases, APIs, Confluence pages, Slack messages — whatever holds your knowledge
- Chunk documents — Split into meaningful segments (this is where most RAG systems fail)
- Generate embeddings — Convert each chunk into a vector (a list of numbers that captures meaning)
- Store in vector database — Pinecone, Weaviate, pgvector, or Milvus
Phase 2: Query Pipeline (Real-time)
- Receive user question — "What's our refund policy for enterprise customers?"
- Generate query embedding — Convert the question to a vector
- Retrieve relevant chunks — Find the most similar document chunks by vector similarity
- Augment the prompt — Inject retrieved chunks into the LLM's context
- Generate response — LLM answers based on the retrieved information
Simple in concept. The devil is in every single step.
Document Chunking: Where Most RAG Systems Fail
If your chunking is bad, your retrieval is bad, and your answers are bad. No amount of prompt engineering can fix retrieving the wrong documents. We learned this the hard way on our first major RAG deployment — the answers were plausible-sounding but wrong because the chunks split information across unnatural boundaries.
Chunking Strategies
Fixed-size chunking (512-1024 tokens per chunk): Simple but dumb. It splits mid-sentence, separates questions from answers, and breaks tables. Use this only for homogeneous text with no structure.
Recursive character splitting: Better — splits on paragraph boundaries, then sentence boundaries, then word boundaries. Most LangChain tutorials use this. It's decent for general-purpose text but still ignores document structure.
Semantic chunking: Our preferred approach. Use an embedding model to detect topic shifts and split at natural semantic boundaries. Chunks represent coherent ideas rather than arbitrary character counts. More expensive to compute but dramatically improves retrieval quality.
Document-structure-aware chunking: Best for structured documents (contracts, technical docs, knowledge bases). Parse the document's heading structure and chunk by section. A "Refund Policy" section becomes one chunk, regardless of length. This preserves the author's organizational intent.
Chunk Size: The Trade-off
Smaller chunks (256-512 tokens) = more precise retrieval, but the LLM may miss context. Larger chunks (1024-2048 tokens) = more context per retrieval, but lower precision and faster context window fill-up.
Our default: 512 tokens with 50-token overlap for general text. 1024 tokens for technical documentation where sections are self-contained. Always test with your actual data — the right size depends on your content.
Metadata is Non-Negotiable
Every chunk should carry metadata: source document, page number, section heading, last-modified date, document type, and any relevant tags. This metadata enables filtering ("only search in HR policies") and citation ("Source: Employee Handbook, Section 3.2, updated January 2026").
Embeddings & Vector Storage
Choosing an Embedding Model
The embedding model converts text into vectors. Better embeddings = better retrieval. Our recommendations:
- OpenAI text-embedding-3-large: Best overall quality. $0.13 per million tokens. Our default for production.
- Cohere embed-v3: Excellent for multilingual content. Slightly cheaper than OpenAI.
- Open-source (BGE, E5): Free to run locally. 90-95% of commercial model quality. Good for sensitive data that can't leave your infrastructure.
One critical rule: use the same embedding model for ingestion and queries. Mixing models (e.g., embedding docs with OpenAI but querying with Cohere) produces terrible results because the vector spaces don't align.
Vector Database Options
| Database | Best For | Cost |
|---|---|---|
| pgvector | Teams already on PostgreSQL. Simple setup, no new infrastructure. | Free (uses existing DB) |
| Pinecone | Managed service, fast scaling. Best for teams that don't want to manage infrastructure. | $70+/month |
| Weaviate | Hybrid search (vector + keyword). Strong filtering. Open-source option. | Free (self-hosted) or managed |
| Milvus | Very large datasets (100M+ vectors). Best raw performance. | Free (self-hosted) |
For most projects, we start with pgvector. If you're already running PostgreSQL (and most companies are), adding vector search takes 30 minutes and zero new infrastructure. We migrate to Pinecone or Weaviate only when we need dedicated vector search features or the dataset exceeds what pgvector handles efficiently (roughly 5M+ vectors).
Retrieval Strategies That Actually Work
Basic vector similarity search gets you 70% of the way. These techniques get you to 95%:
Hybrid Search (Vector + Keyword)
Pure vector search misses exact keyword matches (product names, error codes, specific terms). Hybrid search combines vector similarity with BM25 keyword search, weighted by importance. We use a 70/30 vector/keyword split by default — adjust based on your content type.
Query Expansion
Before searching, use an LLM to rephrase the user's question into 2-3 alternative formulations. "How do I reset my password?" becomes ["password reset process", "change account password", "forgot password steps"]. Search with all variants and merge results. This catches documents that describe the answer differently than how the user asked.
Re-ranking
Retrieve the top 20 results with vector search, then use a cross-encoder model (like Cohere Rerank) to re-score them based on deeper semantic understanding. This is slower but dramatically more accurate than vector similarity alone. We use it for every production system.
Metadata Filtering
Before vector search, filter by metadata: document type, date range, department, access level. "What's the current PTO policy?" should only search HR documents from the last year, not engineering docs from 2023.
Response Generation
The Prompt Template
A good RAG prompt has four parts:
You are an assistant for {company_name}.
Answer questions based ONLY on the provided context.
If the context doesn't contain the answer, say so honestly.
Always cite your sources with document name and section.
Context:
{retrieved_documents}
Question: {user_question}
Answer:
Two critical instructions: "ONLY based on the provided context" (prevents hallucination) and "if the context doesn't contain the answer, say so" (prevents confident wrong answers). Without these, the LLM will happily make up answers that sound plausible but are wrong.
Choosing the Generation Model
For RAG, you don't always need the biggest model. Claude Haiku or GPT-4o-mini handle straightforward Q&A perfectly at 1/10th the cost of Claude Opus. Save the premium models for complex reasoning tasks — multi-step analysis, comparing across documents, synthesizing contradictory information.
Production Considerations
Evaluation: How Do You Know It's Working?
Build a test set of 50-100 question-answer pairs from your actual users. Run them through your RAG pipeline and measure:
- Retrieval accuracy: Did the correct documents appear in the top 5 results?
- Answer correctness: Is the generated answer factually correct based on the source documents?
- Faithfulness: Did the answer stay within the retrieved context (no hallucination)?
- Latency: End-to-end response time under 3 seconds for good UX.
Common Failures and Fixes
- "I don't have that information" when it should: Usually a chunking or embedding issue. The answer exists in your docs but the relevant chunk wasn't retrieved. Try smaller chunks, hybrid search, or query expansion.
- Hallucinated answers: The LLM is generating beyond the retrieved context. Strengthen the system prompt, reduce temperature to 0, or add a verification step that checks the answer against the source chunks.
- Outdated answers: Your ingestion pipeline isn't refreshing. Set up incremental ingestion that detects document changes and re-embeds only modified content.
Cost Optimization
RAG costs come from three places: embedding generation, vector storage, and LLM generation. For cost optimization:
- Cache frequent queries — if 20% of questions are identical, cache the answers
- Use tiered models — Haiku for simple lookups, Sonnet for complex reasoning
- Batch embedding updates — re-embed changed documents nightly, not in real-time
Our AI consulting team has built RAG systems across healthcare, legal, finance, and SaaS. Book a consultation if you'd like help designing a RAG architecture for your specific use case.
Frequently Asked Questions
How many documents can a RAG system handle?
Practically unlimited. We've built RAG systems with 500K+ documents. The vector database handles the scale — query latency stays under 100ms even with millions of vectors on systems like Pinecone or Milvus. The bottleneck is usually the initial embedding of all documents, which can take hours for very large corpora.
RAG vs. fine-tuning: when should I fine-tune instead?
Fine-tune when you need to change the model's behavior (writing style, domain-specific reasoning patterns, output format). Use RAG when you need to give the model access to information. Most business use cases need RAG, not fine-tuning. Some advanced cases benefit from both — a fine-tuned model that also uses RAG for current information.
Can RAG work with images and tables, not just text?
Yes, with extra work. For images: use multimodal embedding models or describe images as text before embedding. For tables: extract table data as structured text (CSV-like format) and embed that. PDF tables are the hardest — we use specialized parsers (like Unstructured.io) to extract clean table data before embedding.
How do I handle documents that contradict each other?
Metadata is key. Attach dates, version numbers, and authority levels to chunks. In the prompt, instruct the LLM to prefer the most recent version and flag contradictions: "The 2026 policy states X, though the 2024 version said Y." Users appreciate transparency about conflicting information rather than the AI silently picking one.