Build a RAG System with Claude: Python Implementation Guide

Q: What embedding model should I use with Claude?

OpenAI's text-embedding-3-small ($0.02/M tokens) is a good default — cheap, fast, high quality. For fully free/local embeddings, all-MiniLM-L6-v2 from sentence-transformers is excellent for most use cases. Claude doesn't provide its own embedding API.

RAG (Retrieval-Augmented Generation) lets Claude answer questions using your documents — product manuals, codebases, knowledge bases, or any text corpus. The pattern: embed documents as vectors, store them, retrieve the most relevant chunks when a question is asked, inject them into Claude's context, and get a grounded answer. This guide covers a complete working implementation from document ingestion to query response.

Why RAG instead of just stuffing documents into context

With Claude's 200K context window, you could theoretically put your entire knowledge base in every request. For small corpora (under 100 documents), this works fine. For larger corpora, RAG wins:

Cost: at 100,000 tokens per request × 1,000 requests/day = 100M tokens/day = $300/day for Sonnet. RAG retrieves 5–10 relevant chunks (~5,000 tokens) instead.
Quality: focused context produces better answers than overwhelming Claude with irrelevant content.
Scale: a 10,000-document knowledge base doesn't fit in context; a retrieval system handles any size.

Architecture overview

Documents → Chunking → Embedding → Vector Store
                                          ↓
User Query → Embedding → Vector Search → Top-K chunks → Claude → Answer

Step 1: Install dependencies

pip install anthropic openai pgvector psycopg2-binary python-dotenv

We use:

anthropic — Claude for answer generation
openai — embeddings (text-embedding-3-small, cheapest + good quality)
pgvector — vector similarity search in PostgreSQL
Alternatively: replace with chromadb or pinecone for in-process or managed vector storage

Step 2: Document chunking

from typing import List
import re

def chunk_text(
    text: str,
    chunk_size: int = 1000,
    overlap: int = 200,
) -> List[str]:
    """
    Split text into overlapping chunks.
    
    chunk_size: target characters per chunk (~250 tokens)
    overlap: characters to repeat between chunks (preserves context at boundaries)
    """
    # Clean whitespace
    text = re.sub(r'\n{3,}', '\n\n', text.strip())
    
    chunks = []
    start = 0
    
    while start < len(text):
        end = start + chunk_size
        
        # Try to end at a sentence boundary
        if end < len(text):
            # Look for sentence end near the target
            sentence_end = text.rfind('. ', start, end)
            paragraph_end = text.rfind('\n\n', start, end)
            
            boundary = max(sentence_end, paragraph_end)
            if boundary > start + chunk_size // 2:
                end = boundary + 1
        
        chunk = text[start:end].strip()
        if chunk:
            chunks.append(chunk)
        
        # Move forward with overlap
        start = end - overlap
    
    return chunks

# Usage
with open("product_manual.txt") as f:
    text = f.read()

chunks = chunk_text(text, chunk_size=1000, overlap=200)
print(f"Created {len(chunks)} chunks")

Step 3: Generate embeddings

from openai import OpenAI
from typing import List

openai_client = OpenAI()  # Reads OPENAI_API_KEY from environment

def embed_texts(texts: List[str], batch_size: int = 100) -> List[List[float]]:
    """
    Generate embeddings for a list of texts.
    Batches requests to avoid rate limits.
    """
    all_embeddings = []
    
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        response = openai_client.embeddings.create(
            model="text-embedding-3-small",
            input=batch,
        )
        batch_embeddings = [item.embedding for item in response.data]
        all_embeddings.extend(batch_embeddings)
        print(f"Embedded {min(i + batch_size, len(texts))}/{len(texts)} texts")
    
    return all_embeddings

Alternative: use a free, local embedding model via sentence-transformers to avoid OpenAI costs:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")  # Free, runs locally

def embed_texts_local(texts: List[str]) -> List[List[float]]:
    embeddings = model.encode(texts, show_progress_bar=True)
    return embeddings.tolist()

Step 4: Store in a vector database

Option A: PostgreSQL with pgvector (recommended for production)

import psycopg2
import json

# Requires: PostgreSQL with pgvector extension
# Neon (neon.tech) supports pgvector on their free tier

conn = psycopg2.connect(os.environ["DATABASE_URL"])

def setup_vector_table(conn):
    """Create the documents table with vector support."""
    with conn.cursor() as cur:
        cur.execute("CREATE EXTENSION IF NOT EXISTS vector")
        cur.execute("""
            CREATE TABLE IF NOT EXISTS documents (
                id SERIAL PRIMARY KEY,
                source TEXT,
                chunk_index INTEGER,
                content TEXT,
                embedding vector(1536),  -- 1536 for text-embedding-3-small
                metadata JSONB DEFAULT '{}'
            )
        """)
        cur.execute("""
            CREATE INDEX IF NOT EXISTS idx_documents_embedding 
            ON documents USING ivfflat (embedding vector_cosine_ops)
            WITH (lists = 100)
        """)
    conn.commit()

def store_documents(conn, source: str, chunks: List[str], embeddings: List[List[float]]):
    """Store document chunks with their embeddings."""
    with conn.cursor() as cur:
        for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
            cur.execute(
                """INSERT INTO documents (source, chunk_index, content, embedding)
                   VALUES (%s, %s, %s, %s)""",
                (source, i, chunk, embedding)
            )
    conn.commit()
    print(f"Stored {len(chunks)} chunks from {source}")

def retrieve_relevant_chunks(
    conn,
    query_embedding: List[float],
    top_k: int = 5,
    similarity_threshold: float = 0.7,
) -> List[dict]:
    """Find chunks most similar to the query."""
    with conn.cursor() as cur:
        cur.execute(
            """
            SELECT content, source, chunk_index,
                   1 - (embedding <=> %s::vector) as similarity
            FROM documents
            WHERE 1 - (embedding <=> %s::vector) > %s
            ORDER BY similarity DESC
            LIMIT %s
            """,
            (query_embedding, query_embedding, similarity_threshold, top_k)
        )
        return [
            {
                "content": row[0],
                "source": row[1],
                "chunk_index": row[2],
                "similarity": row[3],
            }
            for row in cur.fetchall()
        ]

Option B: ChromaDB (in-process, great for development)

import chromadb

chroma_client = chromadb.Client()
collection = chroma_client.create_collection("documents")

def store_documents_chroma(chunks: List[str], embeddings: List[List[float]], source: str):
    collection.add(
        embeddings=embeddings,
        documents=chunks,
        ids=[f"{source}-{i}" for i in range(len(chunks))],
        metadatas=[{"source": source} for _ in chunks],
    )

def retrieve_relevant_chunks_chroma(query_embedding: List[float], top_k: int = 5):
    results = collection.query(query_embeddings=[query_embedding], n_results=top_k)
    return [
        {"content": doc, "source": meta["source"]}
        for doc, meta in zip(results["documents"][0], results["metadatas"][0])
    ]

Step 5: The RAG query function

import anthropic

claude_client = anthropic.Anthropic()

def rag_query(
    question: str,
    conn,
    top_k: int = 5,
    model: str = "claude-sonnet-4-5",
) -> str:
    """
    Answer a question using retrieved document context.
    """
    # Embed the question
    query_embedding = embed_texts([question])[0]
    
    # Retrieve relevant chunks
    chunks = retrieve_relevant_chunks(conn, query_embedding, top_k=top_k)
    
    if not chunks:
        return "I couldn't find relevant information in the knowledge base to answer this question."
    
    # Build context
    context = "\n\n---\n\n".join([
        f"[{chunk['source']}]\n{chunk['content']}"
        for chunk in chunks
    ])
    
    # Ask Claude with the retrieved context
    response = claude_client.messages.create(
        model=model,
        max_tokens=1024,
        system="""You are a helpful assistant that answers questions based on provided documentation.

Rules:
- Answer ONLY from the provided context
- If the context doesn't contain the answer, say "The documentation doesn't cover this"
- Cite the source document for each key claim: [source_name]
- Be concise and direct""",
        messages=[{
            "role": "user",
            "content": f"""Context:
{context}

Question: {question}"""
        }]
    )
    
    return response.content[0].text

# Usage
answer = rag_query(
    question="How do I configure rate limiting for the API?",
    conn=conn,
)
print(answer)

Complete pipeline: ingest + query

def ingest_document(filepath: str, conn):
    """Full pipeline: file → chunks → embeddings → stored."""
    with open(filepath) as f:
        text = f.read()
    
    source = os.path.basename(filepath)
    chunks = chunk_text(text)
    embeddings = embed_texts(chunks)
    store_documents(conn, source, chunks, embeddings)
    print(f"Ingested {source}: {len(chunks)} chunks")

# Setup
setup_vector_table(conn)

# Ingest documents
for doc_path in ["manual.txt", "faq.txt", "api-reference.txt"]:
    ingest_document(doc_path, conn)

# Query
answer = rag_query("What are the API rate limits?", conn)
print(answer)

Frequently asked questions

What's the best chunk size for RAG? 500–1,000 characters (roughly 125–250 tokens) works well for most documents. Too small: individual chunks lack context. Too large: retrieved chunks include irrelevant content. For structured documents (FAQs, numbered lists), chunk at the natural section boundaries rather than by character count.

What embedding model should I use with Claude? OpenAI's text-embedding-3-small ($0.02/M tokens) is a good default — cheap, fast, high quality. For fully free/local embeddings, all-MiniLM-L6-v2 from sentence-transformers is excellent for most use cases. Claude doesn't provide its own embedding API.

How many chunks should I retrieve (top_k)? Start with 5. Too few: answer may be missing key information. Too many: context becomes noisy and Claude's answer quality degrades. For technical documentation where multiple sections cover a topic, 8–10 may work better.

What vector database should I use? For development: ChromaDB (zero config, in-process). For production with existing PostgreSQL: pgvector (Neon supports it free). For large-scale production (1M+ documents): Pinecone or Weaviate. Start simple and upgrade when needed.

Does RAG work with PDFs? Yes — extract text from PDF first, then chunk and embed. Use pypdf or pdfplumber for text extraction. For scanned PDFs, you'll need OCR (pytesseract or a cloud OCR service).

Related guides

How to Reduce Claude Hallucinations: Practical Techniques — RAG is the primary structural solution
Memory and State in Claude Agents: Patterns That Scale — semantic memory pattern using vectors

Take It Further

Claude Agent SDK Cookbook: 40 Production Patterns — Pattern 15 covers the complete Production RAG Architecture: chunking strategies, hybrid search (vector + BM25), reranking, evaluation metrics, and the monitoring setup for tracking retrieval quality in production.

→ Get the Agent SDK Cookbook — $49

30-day money-back guarantee. Instant download.

Build a RAG System with Claude: Python Implementation Guide

Why RAG instead of just stuffing documents into context

Architecture overview

Step 1: Install dependencies

Step 2: Document chunking

Step 3: Generate embeddings

Step 4: Store in a vector database

Step 5: The RAG query function

Complete pipeline: ingest + query

Frequently asked questions

Related guides

Take It Further

Related guides

Claude for Customer Support Automation: Architecture and Implementation

Claude API for Semantic Search: Embeddings Alternatives and RAG Patterns

Claude PDF Analysis: How to Extract Information from Documents

Claude API Security: Protecting Your API Keys and Safe Integration Patterns

Tools and references