← All guides

Build a RAG System with Claude: Python Implementation Guide

How to build a Retrieval-Augmented Generation (RAG) system with Claude and Python — embedding documents, vector search, context injection, and a complete.

Build a RAG System with Claude: Python Implementation Guide

RAG (Retrieval-Augmented Generation) lets Claude answer questions using your documents — product manuals, codebases, knowledge bases, or any text corpus. The pattern: embed documents as vectors, store them, retrieve the most relevant chunks when a question is asked, inject them into Claude's context, and get a grounded answer. This guide covers a complete working implementation from document ingestion to query response.


Why RAG instead of just stuffing documents into context

With Claude's 200K context window, you could theoretically put your entire knowledge base in every request. For small corpora (under 100 documents), this works fine. For larger corpora, RAG wins:


Architecture overview

Documents → Chunking → Embedding → Vector Store
                                          ↓
User Query → Embedding → Vector Search → Top-K chunks → Claude → Answer

Step 1: Install dependencies

pip install anthropic openai pgvector psycopg2-binary python-dotenv

We use:


Step 2: Document chunking

from typing import List
import re

def chunk_text(
    text: str,
    chunk_size: int = 1000,
    overlap: int = 200,
) -> List[str]:
    """
    Split text into overlapping chunks.
    
    chunk_size: target characters per chunk (~250 tokens)
    overlap: characters to repeat between chunks (preserves context at boundaries)
    """
    # Clean whitespace
    text = re.sub(r'\n{3,}', '\n\n', text.strip())
    
    chunks = []
    start = 0
    
    while start < len(text):
        end = start + chunk_size
        
        # Try to end at a sentence boundary
        if end < len(text):
            # Look for sentence end near the target
            sentence_end = text.rfind('. ', start, end)
            paragraph_end = text.rfind('\n\n', start, end)
            
            boundary = max(sentence_end, paragraph_end)
            if boundary > start + chunk_size // 2:
                end = boundary + 1
        
        chunk = text[start:end].strip()
        if chunk:
            chunks.append(chunk)
        
        # Move forward with overlap
        start = end - overlap
    
    return chunks

# Usage
with open("product_manual.txt") as f:
    text = f.read()

chunks = chunk_text(text, chunk_size=1000, overlap=200)
print(f"Created {len(chunks)} chunks")

Step 3: Generate embeddings

from openai import OpenAI
from typing import List

openai_client = OpenAI()  # Reads OPENAI_API_KEY from environment

def embed_texts(texts: List[str], batch_size: int = 100) -> List[List[float]]:
    """
    Generate embeddings for a list of texts.
    Batches requests to avoid rate limits.
    """
    all_embeddings = []
    
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        response = openai_client.embeddings.create(
            model="text-embedding-3-small",
            input=batch,
        )
        batch_embeddings = [item.embedding for item in response.data]
        all_embeddings.extend(batch_embeddings)
        print(f"Embedded {min(i + batch_size, len(texts))}/{len(texts)} texts")
    
    return all_embeddings

Alternative: use a free, local embedding model via sentence-transformers to avoid OpenAI costs:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")  # Free, runs locally

def embed_texts_local(texts: List[str]) -> List[List[float]]:
    embeddings = model.encode(texts, show_progress_bar=True)
    return embeddings.tolist()

Step 4: Store in a vector database

Option A: PostgreSQL with pgvector (recommended for production)

import psycopg2
import json

# Requires: PostgreSQL with pgvector extension
# Neon (neon.tech) supports pgvector on their free tier

conn = psycopg2.connect(os.environ["DATABASE_URL"])

def setup_vector_table(conn):
    """Create the documents table with vector support."""
    with conn.cursor() as cur:
        cur.execute("CREATE EXTENSION IF NOT EXISTS vector")
        cur.execute("""
            CREATE TABLE IF NOT EXISTS documents (
                id SERIAL PRIMARY KEY,
                source TEXT,
                chunk_index INTEGER,
                content TEXT,
                embedding vector(1536),  -- 1536 for text-embedding-3-small
                metadata JSONB DEFAULT '{}'
            )
        """)
        cur.execute("""
            CREATE INDEX IF NOT EXISTS idx_documents_embedding 
            ON documents USING ivfflat (embedding vector_cosine_ops)
            WITH (lists = 100)
        """)
    conn.commit()

def store_documents(conn, source: str, chunks: List[str], embeddings: List[List[float]]):
    """Store document chunks with their embeddings."""
    with conn.cursor() as cur:
        for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
            cur.execute(
                """INSERT INTO documents (source, chunk_index, content, embedding)
                   VALUES (%s, %s, %s, %s)""",
                (source, i, chunk, embedding)
            )
    conn.commit()
    print(f"Stored {len(chunks)} chunks from {source}")

def retrieve_relevant_chunks(
    conn,
    query_embedding: List[float],
    top_k: int = 5,
    similarity_threshold: float = 0.7,
) -> List[dict]:
    """Find chunks most similar to the query."""
    with conn.cursor() as cur:
        cur.execute(
            """
            SELECT content, source, chunk_index,
                   1 - (embedding <=> %s::vector) as similarity
            FROM documents
            WHERE 1 - (embedding <=> %s::vector) > %s
            ORDER BY similarity DESC
            LIMIT %s
            """,
            (query_embedding, query_embedding, similarity_threshold, top_k)
        )
        return [
            {
                "content": row[0],
                "source": row[1],
                "chunk_index": row[2],
                "similarity": row[3],
            }
            for row in cur.fetchall()
        ]

Option B: ChromaDB (in-process, great for development)

import chromadb

chroma_client = chromadb.Client()
collection = chroma_client.create_collection("documents")

def store_documents_chroma(chunks: List[str], embeddings: List[List[float]], source: str):
    collection.add(
        embeddings=embeddings,
        documents=chunks,
        ids=[f"{source}-{i}" for i in range(len(chunks))],
        metadatas=[{"source": source} for _ in chunks],
    )

def retrieve_relevant_chunks_chroma(query_embedding: List[float], top_k: int = 5):
    results = collection.query(query_embeddings=[query_embedding], n_results=top_k)
    return [
        {"content": doc, "source": meta["source"]}
        for doc, meta in zip(results["documents"][0], results["metadatas"][0])
    ]

Step 5: The RAG query function

import anthropic

claude_client = anthropic.Anthropic()

def rag_query(
    question: str,
    conn,
    top_k: int = 5,
    model: str = "claude-sonnet-4-5",
) -> str:
    """
    Answer a question using retrieved document context.
    """
    # Embed the question
    query_embedding = embed_texts([question])[0]
    
    # Retrieve relevant chunks
    chunks = retrieve_relevant_chunks(conn, query_embedding, top_k=top_k)
    
    if not chunks:
        return "I couldn't find relevant information in the knowledge base to answer this question."
    
    # Build context
    context = "\n\n---\n\n".join([
        f"[{chunk['source']}]\n{chunk['content']}"
        for chunk in chunks
    ])
    
    # Ask Claude with the retrieved context
    response = claude_client.messages.create(
        model=model,
        max_tokens=1024,
        system="""You are a helpful assistant that answers questions based on provided documentation.

Rules:
- Answer ONLY from the provided context
- If the context doesn't contain the answer, say "The documentation doesn't cover this"
- Cite the source document for each key claim: [source_name]
- Be concise and direct""",
        messages=[{
            "role": "user",
            "content": f"""Context:
{context}

Question: {question}"""
        }]
    )
    
    return response.content[0].text

# Usage
answer = rag_query(
    question="How do I configure rate limiting for the API?",
    conn=conn,
)
print(answer)

Complete pipeline: ingest + query

def ingest_document(filepath: str, conn):
    """Full pipeline: file → chunks → embeddings → stored."""
    with open(filepath) as f:
        text = f.read()
    
    source = os.path.basename(filepath)
    chunks = chunk_text(text)
    embeddings = embed_texts(chunks)
    store_documents(conn, source, chunks, embeddings)
    print(f"Ingested {source}: {len(chunks)} chunks")

# Setup
setup_vector_table(conn)

# Ingest documents
for doc_path in ["manual.txt", "faq.txt", "api-reference.txt"]:
    ingest_document(doc_path, conn)

# Query
answer = rag_query("What are the API rate limits?", conn)
print(answer)

Frequently asked questions

What's the best chunk size for RAG? 500–1,000 characters (roughly 125–250 tokens) works well for most documents. Too small: individual chunks lack context. Too large: retrieved chunks include irrelevant content. For structured documents (FAQs, numbered lists), chunk at the natural section boundaries rather than by character count.

What embedding model should I use with Claude? OpenAI's text-embedding-3-small ($0.02/M tokens) is a good default — cheap, fast, high quality. For fully free/local embeddings, all-MiniLM-L6-v2 from sentence-transformers is excellent for most use cases. Claude doesn't provide its own embedding API.

How many chunks should I retrieve (top_k)? Start with 5. Too few: answer may be missing key information. Too many: context becomes noisy and Claude's answer quality degrades. For technical documentation where multiple sections cover a topic, 8–10 may work better.

What vector database should I use? For development: ChromaDB (zero config, in-process). For production with existing PostgreSQL: pgvector (Neon supports it free). For large-scale production (1M+ documents): Pinecone or Weaviate. Start simple and upgrade when needed.

Does RAG work with PDFs? Yes — extract text from PDF first, then chunk and embed. Use pypdf or pdfplumber for text extraction. For scanned PDFs, you'll need OCR (pytesseract or a cloud OCR service).


Related guides


Take It Further

Claude Agent SDK Cookbook: 40 Production Patterns — Pattern 15 covers the complete Production RAG Architecture: chunking strategies, hybrid search (vector + BM25), reranking, evaluation metrics, and the monitoring setup for tracking retrieval quality in production.

→ Get the Agent SDK Cookbook — $49

30-day money-back guarantee. Instant download.

AI Disclosure: Drafted with Claude Code; all RAG patterns verified with Python 3.12 and Anthropic SDK as of April 2026.

Tools and references