Claude PDF Analysis: How to Extract Information from Documents
Claude can analyse PDFs in two ways: send the PDF directly as a base64-encoded file (up to 32MB, supports tables and images) or extract the text first and send it as plain text (better cost control, required for PDFs over 32MB). For most document analysis tasks — contract review, invoice extraction, report summarisation — the direct PDF approach is simpler and handles formatting better. For very large document sets, text extraction + optional RAG is more cost-efficient.
Method 1: Send PDF directly to Claude (simplest)
import anthropic
import base64
from pathlib import Path
client = anthropic.Anthropic()
def analyse_pdf(pdf_path: str, question: str) -> str:
"""
Send a PDF directly to Claude and ask a question about it.
Supports PDFs up to 32MB.
"""
pdf_data = Path(pdf_path).read_bytes()
pdf_base64 = base64.standard_b64encode(pdf_data).decode("utf-8")
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=2048,
messages=[
{
"role": "user",
"content": [
{
"type": "document",
"source": {
"type": "base64",
"media_type": "application/pdf",
"data": pdf_base64,
},
},
{
"type": "text",
"text": question,
},
],
}
],
)
return response.content[0].text
# Usage
result = analyse_pdf(
"contract.pdf",
"What are the payment terms and the penalty for late payment?"
)
print(result)
Supported document types (via media_type):
application/pdf— PDF filestext/plain— plain texttext/html— HTML filestext/csv— CSV filestext/xml— XML filesapplication/msword— Word documents (.doc)
Method 2: Text extraction first (for large files / cost control)
For PDFs over 32MB, or when you want to minimise token costs:
import anthropic
from pypdf import PdfReader # pip install pypdf
client = anthropic.Anthropic()
def extract_text_from_pdf(pdf_path: str) -> str:
"""Extract all text from a PDF file."""
reader = PdfReader(pdf_path)
pages = []
for page_num, page in enumerate(reader.pages):
text = page.extract_text()
if text.strip():
pages.append(f"[Page {page_num + 1}]\n{text}")
return "\n\n".join(pages)
def analyse_pdf_text(pdf_path: str, question: str) -> str:
"""Extract PDF text and send to Claude."""
text = extract_text_from_pdf(pdf_path)
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=2048,
messages=[{
"role": "user",
"content": f"Document:\n\n{text}\n\nQuestion: {question}"
}]
)
return response.content[0].text
When to use text extraction:
- PDFs over 32MB
- Batch processing many PDFs (extract once, query many times)
- When you want precise token count control
- When table formatting in the original doesn't matter
Structured data extraction from PDFs
For extracting specific fields (invoices, contracts, forms):
import json
def extract_invoice_data(pdf_path: str) -> dict:
"""Extract structured invoice data from a PDF."""
pdf_data = base64.standard_b64encode(Path(pdf_path).read_bytes()).decode()
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system="""Extract invoice data and return ONLY valid JSON. No other text.
Schema:
{
"invoice_number": "string",
"date": "YYYY-MM-DD",
"vendor": {"name": "string", "address": "string or null"},
"bill_to": {"name": "string", "address": "string or null"},
"line_items": [{"description": "string", "quantity": number, "unit_price": number, "total": number}],
"subtotal": number,
"tax": number,
"total": number,
"payment_due": "YYYY-MM-DD or null",
"currency": "string (e.g. USD)"
}
Use null for missing optional fields. All monetary values as numbers.""",
messages=[{
"role": "user",
"content": [
{
"type": "document",
"source": {"type": "base64", "media_type": "application/pdf", "data": pdf_data},
},
{"type": "text", "text": "Extract the invoice data."},
],
}]
)
try:
return json.loads(response.content[0].text)
except json.JSONDecodeError:
return {"error": "Could not parse invoice", "raw": response.content[0].text[:200]}
# Process multiple invoices
import os
for filename in os.listdir("invoices/"):
if filename.endswith(".pdf"):
data = extract_invoice_data(f"invoices/{filename}")
print(f"{filename}: ${data.get('total', 'error')}")
Batch PDF processing with the Batches API
For processing many PDFs at 50% cost:
def create_pdf_batch(pdf_paths: list[str], question: str) -> str:
"""Create a batch job for PDF analysis."""
requests = []
for i, path in enumerate(pdf_paths):
pdf_data = base64.standard_b64encode(Path(path).read_bytes()).decode()
requests.append({
"custom_id": f"pdf-{i}-{Path(path).stem}",
"params": {
"model": "claude-sonnet-4-5",
"max_tokens": 1024,
"messages": [{
"role": "user",
"content": [
{
"type": "document",
"source": {"type": "base64", "media_type": "application/pdf", "data": pdf_data},
},
{"type": "text", "text": question},
]
}]
}
})
batch = client.messages.batches.create(requests=requests)
return batch.id
# Process 100 invoices at 50% cost
batch_id = create_pdf_batch(
pdf_paths=invoice_files,
question="Extract the total amount and payment due date as JSON."
)
print(f"Batch submitted: {batch_id}")
Handling scanned PDFs
PyPDF extracts text from text-based PDFs only. Scanned PDFs require OCR:
# For scanned PDFs, send directly to Claude (uses vision to read)
# Claude can read text from scanned documents
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{
"type": "document",
"source": {"type": "base64", "media_type": "application/pdf", "data": pdf_base64},
},
{
"type": "text",
"text": "Read and transcribe all text visible in this scanned document.",
},
],
}]
)
Claude's vision capability reads scanned text from images within PDFs. For very low-quality scans, dedicated OCR (Google Cloud Vision, AWS Textract) may be more accurate.
Large document strategy (>200K tokens)
For very large documents that exceed Claude's context window:
def analyse_large_document(pdf_path: str, question: str) -> str:
"""
For documents too large for Claude's context window.
Strategy: extract text → chunk → retrieve relevant chunks → answer.
"""
# 1. Extract text
text = extract_text_from_pdf(pdf_path)
# 2. Chunk (rough estimate: 100K chars ≈ 25K tokens, well within context)
if len(text) < 400_000: # ~100K tokens
# Small enough for single context window
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=2048,
messages=[{"role": "user", "content": f"Document:\n{text}\n\nQuestion: {question}"}]
)
return response.content[0].text
# 3. Too large — use RAG (see build-rag-system guide)
from rag_utils import chunk_text, embed_texts, retrieve_relevant_chunks
chunks = chunk_text(text)
# ... continue with RAG pipeline
return "Document processed via RAG"
Frequently asked questions
What's the maximum PDF size Claude can process? 32MB when sending as a base64-encoded document. For larger files, extract the text first using a PDF library and send as plain text.
Does Claude handle tables in PDFs? Yes, but with varying reliability. Native PDF tables (text-based) are read well. Complex multi-column layouts may be misinterpreted. For critical financial data, verify extracted table values.
Can Claude process password-protected PDFs? No. Remove password protection before sending to the API (with the PDF owner's permission).
Is it cheaper to send the PDF directly or extract text first? Text extraction is usually cheaper if you're processing the same document multiple times (extract once, reuse the text). For one-time analysis, direct PDF sending is simpler and equivalent in cost.
Does Claude retain my PDF content for training? Per Anthropic's privacy policy, API requests are not used to train models by default. If you're processing sensitive documents, confirm the current policy at anthropic.com/privacy.
Related guides
- Build a RAG System with Claude: Python Implementation Guide — for document corpora too large for direct processing
- Claude JSON Structured Output: Getting Reliable JSON Every Time — structured extraction from documents
Take It Further
Claude Agent SDK Cookbook: 40 Production Patterns — Pattern 20 covers the Document Processing Pipeline: batch PDF processing, invoice extraction with validation, large document RAG integration, and the monitoring system for tracking extraction accuracy at scale.
→ Get the Agent SDK Cookbook — $49
30-day money-back guarantee. Instant download.