How do I handle contradicting documents?

Attach dates and version numbers as metadata. Instruct the LLM to prefer recent versions and flag contradictions transparently.

Rag Retrieval Augmented Generation Guide

Q: How many documents can a RAG system handle?

Practically unlimited. RAG systems work with 500K+ documents. Vector databases maintain sub-100ms query latency with millions of vectors.

Q: RAG vs fine-tuning: when should I fine-tune?

Fine-tune for behavior changes (style, reasoning). Use RAG for information access. Most business cases need RAG. Some advanced cases use both.

Q: Can RAG work with images and tables?

Yes. Use multimodal embeddings for images, extract table data as structured text. PDF tables need specialized parsers like Unstructured.io.

In this article

What is RAG?
RAG Architecture
Document Chunking
Embeddings & Vector Storage
Retrieval Strategies
Response Generation
Production Considerations
FAQ

Every week, someone asks us to build "a chatbot that knows about our company." Nine times out of ten, what they actually need is a RAG system. And nine times out of ten, the first version they tried (usually a weekend hackathon project) gave terrible answers because the retrieval was poorly implemented.

We've built RAG systems for knowledge bases, customer support, legal document search, financial analysis, and internal company assistants. The pattern is the same every time — the difference between a RAG demo and a production RAG system is in the details. This guide covers those details.

What is RAG and Why Does It Matter?

Retrieval-Augmented Generation (RAG) is a technique that combines information retrieval with LLM text generation. Instead of relying solely on what the model learned during training, a RAG system fetches relevant documents from your data and includes them in the LLM's context, so it can generate answers grounded in your actual information.

Why not just fine-tune the model on your data? Three reasons:

Cost: Fine-tuning is expensive ($500-50,000+ depending on data volume and model). RAG costs almost nothing beyond vector storage and embedding calls.
Freshness: Fine-tuning creates a point-in-time snapshot. RAG always retrieves the latest data. When your docs update, RAG answers update automatically — no retraining needed.
Traceability: RAG can cite its sources. "Based on your policy document updated March 2026, section 4.2..." Fine-tuned models can't tell you where their answer came from.

For most business applications, RAG is the right choice. Fine-tuning only makes sense when you need to change the model's behavior or writing style fundamentally, not just give it access to new information.

The RAG Architecture

A production RAG system has two main phases:

Phase 1: Ingestion Pipeline (Offline)

Load documents — PDFs, web pages, databases, APIs, Confluence pages, Slack messages — whatever holds your knowledge
Chunk documents — Split into meaningful segments (this is where most RAG systems fail)
Generate embeddings — Convert each chunk into a vector (a list of numbers that captures meaning)
Store in vector database — Pinecone, Weaviate, pgvector, or Milvus

Phase 2: Query Pipeline (Real-time)

Receive user question — "What's our refund policy for enterprise customers?"
Generate query embedding — Convert the question to a vector
Retrieve relevant chunks — Find the most similar document chunks by vector similarity
Augment the prompt — Inject retrieved chunks into the LLM's context
Generate response — LLM answers based on the retrieved information

Simple in concept. The devil is in every single step.

Document Chunking: Where Most RAG Systems Fail

If your chunking is bad, your retrieval is bad, and your answers are bad. No amount of prompt engineering can fix retrieving the wrong documents. We learned this the hard way on our first major RAG deployment — the answers were plausible-sounding but wrong because the chunks split information across unnatural boundaries.

Chunking Strategies

Fixed-size chunking (512-1024 tokens per chunk): Simple but dumb. It splits mid-sentence, separates questions from answers, and breaks tables. Use this only for homogeneous text with no structure.

Recursive character splitting: Better — splits on paragraph boundaries, then sentence boundaries, then word boundaries. Most LangChain tutorials use this. It's decent for general-purpose text but still ignores document structure.

Semantic chunking: Our preferred approach. Use an embedding model to detect topic shifts and split at natural semantic boundaries. Chunks represent coherent ideas rather than arbitrary character counts. More expensive to compute but dramatically improves retrieval quality.

Document-structure-aware chunking: Best for structured documents (contracts, technical docs, knowledge bases). Parse the document's heading structure and chunk by section. A "Refund Policy" section becomes one chunk, regardless of length. This preserves the author's organizational intent.

Chunk Size: The Trade-off

Smaller chunks (256-512 tokens) = more precise retrieval, but the LLM may miss context. Larger chunks (1024-2048 tokens) = more context per retrieval, but lower precision and faster context window fill-up.

Our default: 512 tokens with 50-token overlap for general text. 1024 tokens for technical documentation where sections are self-contained. Always test with your actual data — the right size depends on your content.

Metadata is Non-Negotiable

Every chunk should carry metadata: source document, page number, section heading, last-modified date, document type, and any relevant tags. This metadata enables filtering ("only search in HR policies") and citation ("Source: Employee Handbook, Section 3.2, updated January 2026").

Embeddings & Vector Storage

Choosing an Embedding Model

The embedding model converts text into vectors. Better embeddings = better retrieval. Our recommendations:

OpenAI text-embedding-3-large: Best overall quality. $0.13 per million tokens. Our default for production.
Cohere embed-v3: Excellent for multilingual content. Slightly cheaper than OpenAI.
Open-source (BGE, E5): Free to run locally. 90-95% of commercial model quality. Good for sensitive data that can't leave your infrastructure.

One critical rule: use the same embedding model for ingestion and queries. Mixing models (e.g., embedding docs with OpenAI but querying with Cohere) produces terrible results because the vector spaces don't align.

Vector Database Options

Database	Best For	Cost
pgvector	Teams already on PostgreSQL. Simple setup, no new infrastructure.	Free (uses existing DB)
Pinecone	Managed service, fast scaling. Best for teams that don't want to manage infrastructure.	$70+/month
Weaviate	Hybrid search (vector + keyword). Strong filtering. Open-source option.	Free (self-hosted) or managed
Milvus	Very large datasets (100M+ vectors). Best raw performance.	Free (self-hosted)

For most projects, we start with pgvector. If you're already running PostgreSQL (and most companies are), adding vector search takes 30 minutes and zero new infrastructure. We migrate to Pinecone or Weaviate only when we need dedicated vector search features or the dataset exceeds what pgvector handles efficiently (roughly 5M+ vectors).

Retrieval Strategies That Actually Work

Basic vector similarity search gets you 70% of the way. These techniques get you to 95%:

Hybrid Search (Vector + Keyword)

Pure vector search misses exact keyword matches (product names, error codes, specific terms). Hybrid search combines vector similarity with BM25 keyword search, weighted by importance. We use a 70/30 vector/keyword split by default — adjust based on your content type.

Query Expansion

Before searching, use an LLM to rephrase the user's question into 2-3 alternative formulations. "How do I reset my password?" becomes ["password reset process", "change account password", "forgot password steps"]. Search with all variants and merge results. This catches documents that describe the answer differently than how the user asked.

Re-ranking

Retrieve the top 20 results with vector search, then use a cross-encoder model (like Cohere Rerank) to re-score them based on deeper semantic understanding. This is slower but dramatically more accurate than vector similarity alone. We use it for every production system.

Metadata Filtering

Before vector search, filter by metadata: document type, date range, department, access level. "What's the current PTO policy?" should only search HR documents from the last year, not engineering docs from 2023.

Response Generation

The Prompt Template

A good RAG prompt has four parts:

You are an assistant for {company_name}.
Answer questions based ONLY on the provided context.
If the context doesn't contain the answer, say so honestly.
Always cite your sources with document name and section.

Context:
{retrieved_documents}

Question: {user_question}

Answer:

Two critical instructions: "ONLY based on the provided context" (prevents hallucination) and "if the context doesn't contain the answer, say so" (prevents confident wrong answers). Without these, the LLM will happily make up answers that sound plausible but are wrong.

Choosing the Generation Model

For RAG, you don't always need the biggest model. Claude Haiku or GPT-4o-mini handle straightforward Q&A perfectly at 1/10th the cost of Claude Opus. Save the premium models for complex reasoning tasks — multi-step analysis, comparing across documents, synthesizing contradictory information.

Production Considerations

Evaluation: How Do You Know It's Working?

Build a test set of 50-100 question-answer pairs from your actual users. Run them through your RAG pipeline and measure:

Retrieval accuracy: Did the correct documents appear in the top 5 results?
Answer correctness: Is the generated answer factually correct based on the source documents?
Faithfulness: Did the answer stay within the retrieved context (no hallucination)?
Latency: End-to-end response time under 3 seconds for good UX.

Common Failures and Fixes

"I don't have that information" when it should: Usually a chunking or embedding issue. The answer exists in your docs but the relevant chunk wasn't retrieved. Try smaller chunks, hybrid search, or query expansion.
Hallucinated answers: The LLM is generating beyond the retrieved context. Strengthen the system prompt, reduce temperature to 0, or add a verification step that checks the answer against the source chunks.
Outdated answers: Your ingestion pipeline isn't refreshing. Set up incremental ingestion that detects document changes and re-embeds only modified content.

Cost Optimization

RAG costs come from three places: embedding generation, vector storage, and LLM generation. For cost optimization:

Cache frequent queries — if 20% of questions are identical, cache the answers
Use tiered models — Haiku for simple lookups, Sonnet for complex reasoning
Batch embedding updates — re-embed changed documents nightly, not in real-time

Our AI consulting team has built RAG systems across healthcare, legal, finance, and SaaS. Book a consultation if you'd like help designing a RAG architecture for your specific use case.

Frequently Asked Questions

How many documents can a RAG system handle?

Practically unlimited. We've built RAG systems with 500K+ documents. The vector database handles the scale — query latency stays under 100ms even with millions of vectors on systems like Pinecone or Milvus. The bottleneck is usually the initial embedding of all documents, which can take hours for very large corpora.

RAG vs. fine-tuning: when should I fine-tune instead?

Fine-tune when you need to change the model's behavior (writing style, domain-specific reasoning patterns, output format). Use RAG when you need to give the model access to information. Most business use cases need RAG, not fine-tuning. Some advanced cases benefit from both — a fine-tuned model that also uses RAG for current information.

Can RAG work with images and tables, not just text?

Yes, with extra work. For images: use multimodal embedding models or describe images as text before embedding. For tables: extract table data as structured text (CSV-like format) and embed that. PDF tables are the hardest — we use specialized parsers (like Unstructured.io) to extract clean table data before embedding.

How do I handle documents that contradict each other?

Metadata is key. Attach dates, version numbers, and authority levels to chunks. In the prompt, instruct the LLM to prefer the most recent version and flag contradictions: "The 2026 policy states X, though the 2024 version said Y." Users appreciate transparency about conflicting information rather than the AI silently picking one.

Need a RAG System Built for Your Data?

We've built RAG systems for knowledge bases, legal search, customer support, and financial analysis. Let us design one for your specific use case.

Book a Free Consultation AI Development Services

RAG: Complete Implementation Guide

What is RAG and Why Does It Matter?

The RAG Architecture

Phase 1: Ingestion Pipeline (Offline)

Phase 2: Query Pipeline (Real-time)

Document Chunking: Where Most RAG Systems Fail

Chunking Strategies

Chunk Size: The Trade-off

Metadata is Non-Negotiable

Embeddings & Vector Storage

Choosing an Embedding Model

Vector Database Options

Retrieval Strategies That Actually Work

Hybrid Search (Vector + Keyword)

Query Expansion

Re-ranking

Metadata Filtering

Response Generation

The Prompt Template

Choosing the Generation Model

Production Considerations

Evaluation: How Do You Know It's Working?

Common Failures and Fixes

Cost Optimization

Frequently Asked Questions

How many documents can a RAG system handle?

RAG vs. fine-tuning: when should I fine-tune instead?

Can RAG work with images and tables, not just text?

How do I handle documents that contradict each other?

Related Articles

Vector Databases Explained

Generative AI Business Use Cases

AI Chatbots: Build vs Buy in 2026

Need a RAG System Built for Your Data?

RAG: Complete Implementation Guide

What is RAG and Why Does It Matter?

The RAG Architecture

Phase 1: Ingestion Pipeline (Offline)

Phase 2: Query Pipeline (Real-time)

Document Chunking: Where Most RAG Systems Fail

Chunking Strategies

Chunk Size: The Trade-off

Metadata is Non-Negotiable

Embeddings & Vector Storage

Choosing an Embedding Model

Vector Database Options

Retrieval Strategies That Actually Work

Hybrid Search (Vector + Keyword)

Query Expansion

Re-ranking

Metadata Filtering

Response Generation

The Prompt Template

Choosing the Generation Model

Production Considerations

Evaluation: How Do You Know It's Working?

Common Failures and Fixes

Cost Optimization

Frequently Asked Questions

How many documents can a RAG system handle?

RAG vs. fine-tuning: when should I fine-tune instead?

Can RAG work with images and tables, not just text?

How do I handle documents that contradict each other?

Related Articles

Vector Databases Explained

Generative AI Business Use Cases

AI Chatbots: Build vs Buy in 2026

Need a RAG System Built for Your Data?

Book a Free Consultation

Your Details

Pick a 30-min Slot

Thank You!