Claude API + Pinecone Vector DB: Production RAG Pipeline (2026)
A production RAG pipeline with Claude + Pinecone is 5 steps: chunk documents into 500-token segments, embed with Voyage AI (cheaper than OpenAI embeddings for production), store in Pinecone serverless, retrieve top-K with reranking, and pass to Claude with <untrusted_documents> wrapper. Total cost per query: ~$0.001 (1,000 queries = $1) at 50K-document scale. This guide is end-to-end: chunking strategy, embedding choice, index setup, retrieval, prompt construction, and the 5 mistakes that kill quality.
For Claude API basics see Build a RAG System with Claude + Python. For security defenses see Claude Prompt Injection Defense.
Why Pinecone (and Why Voyage AI Embeddings)
| Vector DB | Pricing | Best for |
|---|---|---|
| Pinecone Serverless | $0.33/M write + $0.18/M read | Production, 1M-100M vectors |
| pgvector (Postgres) | DB cost only | Small scale (<1M vectors), already on Postgres |
| Qdrant | $25/mo cluster | Self-host preferred |
| Weaviate | $25/mo cluster | Hybrid search (vector + keyword) |
Pinecone wins for production: zero ops, serverless billing, <50ms p99 retrieval at scale.
| Embedding | Cost/M tokens | Quality (MTEB) |
|---|---|---|
| Voyage AI voyage-3 | $0.06 | 65.4 |
| OpenAI text-embedding-3-large | $0.13 | 64.6 |
| OpenAI text-embedding-3-small | $0.02 | 62.3 |
| Cohere embed-v3 | $0.10 | 64.5 |
Voyage AI is Anthropic's recommended embedding partner: best quality-per-dollar for Claude RAG.
Step 1: Chunking Strategy
Documents must be split into chunks that fit Claude's context window AND give clean retrieval boundaries.
def chunk_document(text: str, chunk_tokens=500, overlap_tokens=50) -> list[dict]:
"""Split text into ~500-token chunks with 50-token overlap."""
# Rough estimate: 1 token โ 4 chars English, 2 chars Korean
chunk_chars = chunk_tokens * 4
overlap_chars = overlap_tokens * 4
chunks = []
start = 0
while start < len(text):
end = min(start + chunk_chars, len(text))
# Try to break at sentence boundary
if end < len(text):
sentence_end = text.rfind(". ", start, end)
if sentence_end > start + chunk_chars // 2:
end = sentence_end + 2
chunks.append({
"text": text[start:end],
"start": start,
"end": end
})
start = end - overlap_chars
return chunks
Chunk size rule of thumb:
- 500 tokens: most cases (recall + precision balance)
- 200 tokens: FAQ-style (high precision)
- 1000+ tokens: technical docs (preserves context)
Step 2: Embed with Voyage AI
import voyageai
vo = voyageai.Client() # uses VOYAGE_API_KEY env var
def embed_batch(texts: list[str]) -> list[list[float]]:
result = vo.embed(texts, model="voyage-3", input_type="document")
return result.embeddings
# Embed all chunks in batches of 128 (Voyage limit)
all_embeddings = []
for i in range(0, len(chunks), 128):
batch = [c["text"] for c in chunks[i:i+128]]
all_embeddings.extend(embed_batch(batch))
Cost at 50K chunks ร 500 tokens = 25M tokens: 25 ร $0.06 = $1.50 one-time. Negligible.
Step 3: Store in Pinecone
from pinecone import Pinecone, ServerlessSpec
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
# Create index (one-time)
pc.create_index(
name="claude-rag",
dimension=1024, # voyage-3 outputs 1024-dim
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1")
)
index = pc.Index("claude-rag")
# Upsert chunks
vectors = [
{
"id": f"doc-{doc_id}-chunk-{i}",
"values": emb,
"metadata": {
"text": chunk["text"],
"doc_id": doc_id,
"source": doc_source
}
}
for i, (chunk, emb) in enumerate(zip(chunks, all_embeddings))
]
# Batch upsert (max 100 per call)
for i in range(0, len(vectors), 100):
index.upsert(vectors[i:i+100])
Step 4: Retrieve + Rerank
Two-stage retrieval is the production standard: cheap vector search retrieves 20, expensive reranker picks top 5.
def retrieve(query: str, top_k_initial=20, top_k_final=5) -> list[dict]:
# Stage 1: vector search
query_emb = vo.embed([query], model="voyage-3",
input_type="query").embeddings[0]
results = index.query(vector=query_emb, top_k=top_k_initial,
include_metadata=True)
# Stage 2: rerank with Voyage AI reranker
docs = [r["metadata"]["text"] for r in results["matches"]]
reranked = vo.rerank(query=query, documents=docs,
model="rerank-2", top_k=top_k_final)
# Map back to original metadata
final = []
for r in reranked.results:
original = results["matches"][r.index]
final.append({
"text": r.document,
"score": r.relevance_score,
"source": original["metadata"]["source"]
})
return final
Reranker cost: $0.05/M query+doc tokens. For 20 docs ร 500 tokens ร 1 query = 10K tokens = $0.0005.
Step 5: Generate Answer with Claude
def answer_with_claude(query: str, retrieved: list[dict]) -> str:
# Wrap retrieved chunks with untrusted markers (security)
context = "\n\n".join([
f"<untrusted_document source=\"{c['source']}\">\n{c['text']}\n</untrusted_document>"
for c in retrieved
])
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=[{
"type": "text",
"text": "You are a helpful assistant. Answer using ONLY the provided documents. If the answer isn't in the documents, say so. Cite sources by their source name.",
"cache_control": {"type": "ephemeral"} # cache system prompt
}],
messages=[{
"role": "user",
"content": f"Documents:\n\n{context}\n\nQuestion: {query}"
}]
)
return response.content[0].text
The cache_control saves 90% on the system prompt cost across repeated queries.
Per-Query Cost Math
For a typical query against 50K-chunk index:
| Step | Cost |
|---|---|
| Query embedding (1 ร 50 tokens) | $0.000003 |
| Pinecone read (20 vectors) | $0.0000036 |
| Reranker (10K tokens) | $0.0005 |
| Claude Sonnet (3K input + 200 output) | $0.012 + $0.003 = $0.015 |
| Total | ~$0.016/query |
With prompt caching on system prompt (90% off after first query): ~$0.001/query.
At 1,000 queries/day = $1/day = $30/month.
The 5 Mistakes That Kill RAG Quality
1. Chunks too small or too big
200-token chunks lose context. 2000-token chunks dilute relevance. Stick to 500 ยฑ 200 unless you have a specific reason.
2. No reranking
Vector search alone has ~70% top-5 accuracy. Reranking pushes it to ~90%+. The $0.0005 reranker cost is worth it.
3. Untrusted content not wrapped
Retrieved docs can contain prompt injection. Always wrap with <untrusted_document> markers. See Prompt Injection Defense Pattern 1.
4. No "I don't know" instruction
Without explicit "say so if not in docs", Claude will hallucinate from its training. Always include this instruction in system prompt.
5. Forgetting to cache the system prompt
The system prompt is identical across queries. Without cache_control, you pay full price every time โ 80%+ wasted spend.
Frequently Asked Questions
Why not use OpenAI embeddings?
Voyage AI voyage-3 outperforms OpenAI text-embedding-3-large on MTEB benchmark (65.4 vs 64.6) at ~50% the cost. Anthropic recommends Voyage for Claude RAG. Switch is a one-line model name change.
Pinecone vs pgvector for small scale?
Under 1M vectors, pgvector on existing Postgres is simpler and cheaper. Above 1M, Pinecone serverless wins on ops + latency. Migration is straightforward โ both speak vector search.
Can I use Claude for embeddings?
No. Claude is a generation model only. Use a dedicated embedding model (Voyage AI, OpenAI, or open-source like BGE) for the embedding step.
How do I handle multi-language docs?
Voyage AI voyage-3 is multilingual. Embed Korean and English docs together; queries retrieve relevant content regardless of language. For Korean-only RAG, see Korean Prompt Engineering.
What's the latency budget?
End-to-end p95: embedding 50ms + Pinecone query 50ms + rerank 200ms + Claude streaming start 800ms = ~1.1s to first token. Most users see <2s for full answer.
Master Production Claude API Architecture
Claude Agent SDK Cookbook ($79) โ 40 production agent recipes including RAG pipelines with Pinecone, pgvector, Qdrant. Cost optimization, evaluation, and observability included.