Build a RAG System with Claude: Python Implementation Guide
RAG (Retrieval-Augmented Generation) lets Claude answer questions using your documents — product manuals, codebases, knowledge bases, or any text corpus. The pattern: embed documents as vectors, store them, retrieve the most relevant chunks when a question is asked, inject them into Claude's context, and get a grounded answer. This guide covers a complete working implementation from document ingestion to query response.
Why RAG instead of just stuffing documents into context
With Claude's 200K context window, you could theoretically put your entire knowledge base in every request. For small corpora (under 100 documents), this works fine. For larger corpora, RAG wins:
- Cost: at 100,000 tokens per request × 1,000 requests/day = 100M tokens/day = $300/day for Sonnet. RAG retrieves 5–10 relevant chunks (~5,000 tokens) instead.
- Quality: focused context produces better answers than overwhelming Claude with irrelevant content.
- Scale: a 10,000-document knowledge base doesn't fit in context; a retrieval system handles any size.
Architecture overview
Documents → Chunking → Embedding → Vector Store
↓
User Query → Embedding → Vector Search → Top-K chunks → Claude → Answer
Step 1: Install dependencies
pip install anthropic openai pgvector psycopg2-binary python-dotenv
We use:
anthropic— Claude for answer generationopenai— embeddings (text-embedding-3-small, cheapest + good quality)pgvector— vector similarity search in PostgreSQL- Alternatively: replace with
chromadborpineconefor in-process or managed vector storage
Step 2: Document chunking
from typing import List
import re
def chunk_text(
text: str,
chunk_size: int = 1000,
overlap: int = 200,
) -> List[str]:
"""
Split text into overlapping chunks.
chunk_size: target characters per chunk (~250 tokens)
overlap: characters to repeat between chunks (preserves context at boundaries)
"""
# Clean whitespace
text = re.sub(r'\n{3,}', '\n\n', text.strip())
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
# Try to end at a sentence boundary
if end < len(text):
# Look for sentence end near the target
sentence_end = text.rfind('. ', start, end)
paragraph_end = text.rfind('\n\n', start, end)
boundary = max(sentence_end, paragraph_end)
if boundary > start + chunk_size // 2:
end = boundary + 1
chunk = text[start:end].strip()
if chunk:
chunks.append(chunk)
# Move forward with overlap
start = end - overlap
return chunks
# Usage
with open("product_manual.txt") as f:
text = f.read()
chunks = chunk_text(text, chunk_size=1000, overlap=200)
print(f"Created {len(chunks)} chunks")
Step 3: Generate embeddings
from openai import OpenAI
from typing import List
openai_client = OpenAI() # Reads OPENAI_API_KEY from environment
def embed_texts(texts: List[str], batch_size: int = 100) -> List[List[float]]:
"""
Generate embeddings for a list of texts.
Batches requests to avoid rate limits.
"""
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
response = openai_client.embeddings.create(
model="text-embedding-3-small",
input=batch,
)
batch_embeddings = [item.embedding for item in response.data]
all_embeddings.extend(batch_embeddings)
print(f"Embedded {min(i + batch_size, len(texts))}/{len(texts)} texts")
return all_embeddings
Alternative: use a free, local embedding model via sentence-transformers to avoid OpenAI costs:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2") # Free, runs locally
def embed_texts_local(texts: List[str]) -> List[List[float]]:
embeddings = model.encode(texts, show_progress_bar=True)
return embeddings.tolist()
Step 4: Store in a vector database
Option A: PostgreSQL with pgvector (recommended for production)
import psycopg2
import json
# Requires: PostgreSQL with pgvector extension
# Neon (neon.tech) supports pgvector on their free tier
conn = psycopg2.connect(os.environ["DATABASE_URL"])
def setup_vector_table(conn):
"""Create the documents table with vector support."""
with conn.cursor() as cur:
cur.execute("CREATE EXTENSION IF NOT EXISTS vector")
cur.execute("""
CREATE TABLE IF NOT EXISTS documents (
id SERIAL PRIMARY KEY,
source TEXT,
chunk_index INTEGER,
content TEXT,
embedding vector(1536), -- 1536 for text-embedding-3-small
metadata JSONB DEFAULT '{}'
)
""")
cur.execute("""
CREATE INDEX IF NOT EXISTS idx_documents_embedding
ON documents USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100)
""")
conn.commit()
def store_documents(conn, source: str, chunks: List[str], embeddings: List[List[float]]):
"""Store document chunks with their embeddings."""
with conn.cursor() as cur:
for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
cur.execute(
"""INSERT INTO documents (source, chunk_index, content, embedding)
VALUES (%s, %s, %s, %s)""",
(source, i, chunk, embedding)
)
conn.commit()
print(f"Stored {len(chunks)} chunks from {source}")
def retrieve_relevant_chunks(
conn,
query_embedding: List[float],
top_k: int = 5,
similarity_threshold: float = 0.7,
) -> List[dict]:
"""Find chunks most similar to the query."""
with conn.cursor() as cur:
cur.execute(
"""
SELECT content, source, chunk_index,
1 - (embedding <=> %s::vector) as similarity
FROM documents
WHERE 1 - (embedding <=> %s::vector) > %s
ORDER BY similarity DESC
LIMIT %s
""",
(query_embedding, query_embedding, similarity_threshold, top_k)
)
return [
{
"content": row[0],
"source": row[1],
"chunk_index": row[2],
"similarity": row[3],
}
for row in cur.fetchall()
]
Option B: ChromaDB (in-process, great for development)
import chromadb
chroma_client = chromadb.Client()
collection = chroma_client.create_collection("documents")
def store_documents_chroma(chunks: List[str], embeddings: List[List[float]], source: str):
collection.add(
embeddings=embeddings,
documents=chunks,
ids=[f"{source}-{i}" for i in range(len(chunks))],
metadatas=[{"source": source} for _ in chunks],
)
def retrieve_relevant_chunks_chroma(query_embedding: List[float], top_k: int = 5):
results = collection.query(query_embeddings=[query_embedding], n_results=top_k)
return [
{"content": doc, "source": meta["source"]}
for doc, meta in zip(results["documents"][0], results["metadatas"][0])
]
Step 5: The RAG query function
import anthropic
claude_client = anthropic.Anthropic()
def rag_query(
question: str,
conn,
top_k: int = 5,
model: str = "claude-sonnet-4-5",
) -> str:
"""
Answer a question using retrieved document context.
"""
# Embed the question
query_embedding = embed_texts([question])[0]
# Retrieve relevant chunks
chunks = retrieve_relevant_chunks(conn, query_embedding, top_k=top_k)
if not chunks:
return "I couldn't find relevant information in the knowledge base to answer this question."
# Build context
context = "\n\n---\n\n".join([
f"[{chunk['source']}]\n{chunk['content']}"
for chunk in chunks
])
# Ask Claude with the retrieved context
response = claude_client.messages.create(
model=model,
max_tokens=1024,
system="""You are a helpful assistant that answers questions based on provided documentation.
Rules:
- Answer ONLY from the provided context
- If the context doesn't contain the answer, say "The documentation doesn't cover this"
- Cite the source document for each key claim: [source_name]
- Be concise and direct""",
messages=[{
"role": "user",
"content": f"""Context:
{context}
Question: {question}"""
}]
)
return response.content[0].text
# Usage
answer = rag_query(
question="How do I configure rate limiting for the API?",
conn=conn,
)
print(answer)
Complete pipeline: ingest + query
def ingest_document(filepath: str, conn):
"""Full pipeline: file → chunks → embeddings → stored."""
with open(filepath) as f:
text = f.read()
source = os.path.basename(filepath)
chunks = chunk_text(text)
embeddings = embed_texts(chunks)
store_documents(conn, source, chunks, embeddings)
print(f"Ingested {source}: {len(chunks)} chunks")
# Setup
setup_vector_table(conn)
# Ingest documents
for doc_path in ["manual.txt", "faq.txt", "api-reference.txt"]:
ingest_document(doc_path, conn)
# Query
answer = rag_query("What are the API rate limits?", conn)
print(answer)
Frequently asked questions
What's the best chunk size for RAG? 500–1,000 characters (roughly 125–250 tokens) works well for most documents. Too small: individual chunks lack context. Too large: retrieved chunks include irrelevant content. For structured documents (FAQs, numbered lists), chunk at the natural section boundaries rather than by character count.
What embedding model should I use with Claude?
OpenAI's text-embedding-3-small ($0.02/M tokens) is a good default — cheap, fast, high quality. For fully free/local embeddings, all-MiniLM-L6-v2 from sentence-transformers is excellent for most use cases. Claude doesn't provide its own embedding API.
How many chunks should I retrieve (top_k)? Start with 5. Too few: answer may be missing key information. Too many: context becomes noisy and Claude's answer quality degrades. For technical documentation where multiple sections cover a topic, 8–10 may work better.
What vector database should I use? For development: ChromaDB (zero config, in-process). For production with existing PostgreSQL: pgvector (Neon supports it free). For large-scale production (1M+ documents): Pinecone or Weaviate. Start simple and upgrade when needed.
Does RAG work with PDFs?
Yes — extract text from PDF first, then chunk and embed. Use pypdf or pdfplumber for text extraction. For scanned PDFs, you'll need OCR (pytesseract or a cloud OCR service).
Related guides
- How to Reduce Claude Hallucinations: Practical Techniques — RAG is the primary structural solution
- Memory and State in Claude Agents: Patterns That Scale — semantic memory pattern using vectors
Take It Further
Claude Agent SDK Cookbook: 40 Production Patterns — Pattern 15 covers the complete Production RAG Architecture: chunking strategies, hybrid search (vector + BM25), reranking, evaluation metrics, and the monitoring setup for tracking retrieval quality in production.
→ Get the Agent SDK Cookbook — $49
30-day money-back guarantee. Instant download.