Claude API PDF & Document Parsing Guide
To parse PDFs with the Claude API, encode the file as base64 and pass it as a document content block in your message. Claude reads the entire document natively — no external OCR step required. For a 10-page contract, this takes under three seconds with claude-haiku-4-5. For structured extraction (tables, form fields, key-value pairs), include an explicit JSON schema in your prompt. Claude returns machine-readable output in one API call, skipping the multi-tool OCR pipelines that traditionally add latency, cost, and failure points.
Using Claude's Native PDF Support (Base64 Upload)
Claude accepts PDF files directly via the document content type. Encode the file bytes as base64 and attach it alongside your text prompt.
import anthropic
import base64
from pathlib import Path
client = anthropic.Anthropic()
def parse_pdf(pdf_path: str, prompt: str) -> str:
"""Send a PDF to Claude and return the model's response."""
pdf_bytes = Path(pdf_path).read_bytes()
pdf_b64 = base64.standard_b64encode(pdf_bytes).decode("utf-8")
response = client.messages.create(
model="claude-haiku-4-5", # Haiku: fast and cheap for most docs
max_tokens=2048,
messages=[
{
"role": "user",
"content": [
{
"type": "document",
"source": {
"type": "base64",
"media_type": "application/pdf",
"data": pdf_b64,
},
},
{
"type": "text",
"text": prompt,
},
],
}
],
)
return response.content[0].text
# Usage
summary = parse_pdf(
"invoice.pdf",
"Extract the invoice number, vendor name, total amount, and due date. Return JSON.",
)
print(summary)
File size limit: Claude accepts PDFs up to 32 MB per request. Documents larger than this should be split before sending. For vision-heavy documents (scanned pages, diagrams), see the Claude Vision and Multimodal Guide.
Prompt caching tip: If you query the same document multiple times (e.g., asking several questions in sequence), use cache_control on the document block to avoid re-encoding the same bytes. Cache hits cost 10% of normal input token price.
# Cache the document for repeated queries
document_block = {
"type": "document",
"source": {
"type": "base64",
"media_type": "application/pdf",
"data": pdf_b64,
},
"cache_control": {"type": "ephemeral"}, # Cache for up to 5 min
}
Multi-Page Document Strategies
For documents longer than ~50 pages, a single API call may return truncated or shallow responses. Use these strategies to maintain accuracy.
Strategy 1 — Chunked processing: Split the PDF by page range, process each chunk, then merge results.
import anthropic
import base64
from pypdf import PdfReader, PdfWriter
from io import BytesIO
client = anthropic.Anthropic()
def pdf_to_chunks(pdf_path: str, pages_per_chunk: int = 20) -> list[bytes]:
"""Split a PDF into fixed-size page chunks."""
reader = PdfReader(pdf_path)
total = len(reader.pages)
chunks = []
for start in range(0, total, pages_per_chunk):
writer = PdfWriter()
for i in range(start, min(start + pages_per_chunk, total)):
writer.add_page(reader.pages[i])
buf = BytesIO()
writer.write(buf)
chunks.append(buf.getvalue())
return chunks
def extract_from_large_pdf(pdf_path: str, prompt: str) -> list[str]:
"""Process a large PDF in chunks and return per-chunk responses."""
chunks = pdf_to_chunks(pdf_path, pages_per_chunk=20)
results = []
for i, chunk_bytes in enumerate(chunks):
b64 = base64.standard_b64encode(chunk_bytes).decode()
resp = client.messages.create(
model="claude-haiku-4-5",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{
"type": "document",
"source": {
"type": "base64",
"media_type": "application/pdf",
"data": b64,
},
},
{"type": "text", "text": f"[Chunk {i+1}/{len(chunks)}] {prompt}"},
],
}
],
)
results.append(resp.content[0].text)
return results
Strategy 2 — Hierarchical summarization: Use Haiku to summarize each chunk, then feed the summaries to Sonnet for a final synthesis pass. This keeps costs low while preserving accuracy across long documents.
For model selection decisions, see Claude Haiku vs Sonnet vs Opus — Which Model to Use.
Extracting Structured Data (Tables, Forms, Key-Value Pairs)
Prompt Claude with an explicit JSON schema to guarantee machine-readable output. Include field names, types, and constraints directly in the prompt.
import json
import anthropic
import base64
client = anthropic.Anthropic()
INVOICE_SCHEMA = """
Extract the following fields as valid JSON. Use null for missing fields.
{
"invoice_number": "string",
"vendor_name": "string",
"vendor_address": "string",
"invoice_date": "YYYY-MM-DD",
"due_date": "YYYY-MM-DD",
"line_items": [
{
"description": "string",
"quantity": number,
"unit_price": number,
"total": number
}
],
"subtotal": number,
"tax_rate": number,
"tax_amount": number,
"total_amount": number,
"currency": "string (ISO 4217)"
}
Return only the JSON object, no markdown fences.
"""
def extract_invoice(pdf_path: str) -> dict:
pdf_b64 = base64.standard_b64encode(open(pdf_path, "rb").read()).decode()
resp = client.messages.create(
model="claude-sonnet-4-6", # Sonnet for complex multi-field extraction
max_tokens=2048,
messages=[
{
"role": "user",
"content": [
{
"type": "document",
"source": {
"type": "base64",
"media_type": "application/pdf",
"data": pdf_b64,
},
},
{"type": "text", "text": INVOICE_SCHEMA},
],
}
],
)
return json.loads(resp.content[0].text)
Table extraction tip: For documents with complex merged cells, ask Claude to flatten the table into an array of row objects. Specify whether to include headers as keys or as a separate row.
Upgrade Your Document Pipelines
Build production document-processing agents: multi-step PDF pipelines, structured extraction chains, Batch API integration, and cost-optimized routing between Haiku/Sonnet/Opus. 100+ copy-paste Python recipes.
→ Get the Agent SDK Cookbook — $49
Instant download. 30-day money-back guarantee.
Comparing Document Parsing Approaches
| Approach | Setup complexity | Accuracy (scanned docs) | Latency | Cost per page | Best for |
|---|---|---|---|---|---|
| Claude native (base64) | Minimal — 1 API call | High (digital PDFs), Medium (scanned) | 1–3s for 10 pages | ~$0.0004 with Haiku | Digital PDFs, fast prototyping |
| Tesseract + Claude | Medium — run OCR first, pass text | High (scanned), depends on OCR quality | 5–15s (OCR adds latency) | ~$0.0001 + OCR infra | Scanned docs at scale, offline OCR |
| Amazon Textract + Claude | High — AWS setup, IAM, S3 | Very high (tables, forms, signatures) | 10–30s async | ~$0.015 per page (Textract) + Claude | Complex forms, regulated industries |
Recommendation: Start with Claude native. Add Textract only when you need its specialized form/signature detection at regulated accuracy levels (healthcare, legal). Tesseract is a cost-effective middle path if you already run on-prem infrastructure.
For semantic search over extracted content, see Claude API Semantic Search.
Batch Document Processing
For processing hundreds of PDFs, use the Anthropic Batch API to reduce cost by 50% and bypass per-minute rate limits.
import anthropic
import base64
from pathlib import Path
client = anthropic.Anthropic()
def build_batch_requests(pdf_paths: list[str], prompt: str) -> list[dict]:
"""Build a list of batch request objects from a list of PDF paths."""
requests = []
for path in pdf_paths:
b64 = base64.standard_b64encode(Path(path).read_bytes()).decode()
requests.append(
{
"custom_id": Path(path).stem,
"params": {
"model": "claude-haiku-4-5",
"max_tokens": 1024,
"messages": [
{
"role": "user",
"content": [
{
"type": "document",
"source": {
"type": "base64",
"media_type": "application/pdf",
"data": b64,
},
},
{"type": "text", "text": prompt},
],
}
],
},
}
)
return requests
def run_batch(pdf_paths: list[str], prompt: str) -> str:
"""Submit a batch job and return the batch ID for polling."""
requests = build_batch_requests(pdf_paths, prompt)
batch = client.beta.messages.batches.create(requests=requests)
print(f"Batch submitted: {batch.id} — {len(requests)} documents")
return batch.id
def collect_batch_results(batch_id: str) -> dict[str, str]:
"""Poll until complete and return {custom_id: response_text}."""
import time
while True:
batch = client.beta.messages.batches.retrieve(batch_id)
if batch.processing_status == "ended":
break
print(f"Status: {batch.processing_status} — waiting 10s")
time.sleep(10)
results = {}
for result in client.beta.messages.batches.results(batch_id):
if result.result.type == "succeeded":
results[result.custom_id] = result.result.message.content[0].text
return results
Cost math: Processing 1,000 invoices with Haiku at native rate ≈ $0.40. With Batch API (50% discount) ≈ $0.20. For 10,000 invoices per month, Batch API saves ~$24/month without any architecture change.
Agent SDK Cookbook — Document Processing Recipes
The cookbook includes complete document-processing agent blueprints: PDF ingestion pipelines, multi-step extraction chains, error-recovery patterns for malformed documents, and integration examples with PostgreSQL and S3. All recipes use the Anthropic Python SDK with prompt caching enabled by default.
→ Get the Agent SDK Cookbook — $49
Instant download. 30-day money-back guarantee.
Frequently Asked Questions
Does Claude support scanned (image-based) PDFs?
Yes. Claude's vision capabilities apply to scanned PDFs. The model reads each page as an image and extracts text, tables, and layout information. Accuracy is slightly lower than on digital PDFs with embedded text, especially for low-resolution scans below 150 DPI. For high-volume scanned document workloads where accuracy is critical, pre-process with Tesseract or Amazon Textract to produce clean text, then pass the text to Claude for semantic extraction.
What is the maximum PDF size Claude can accept?
The current limit is 32 MB per file and up to 100 pages per document block. Documents exceeding these limits should be split before sending. Use the chunked processing pattern shown above — split into 20-page segments, process in parallel, and aggregate results.
How do I extract data from a PDF form with checkboxes and signatures?
Use claude-sonnet-4-6 with a prompt that explicitly lists every form field including checkboxes and signature lines. Ask Claude to return a JSON object with boolean values for checkboxes (true/false) and a string status for signatures ("signed", "unsigned", or "initials only"). For legally binding signature verification, combine with Amazon Textract's signature detection, which provides a confidence score.
How much does PDF parsing cost with the Claude API?
With claude-haiku-4-5, a 10-page digital PDF typically costs $0.0003–$0.0005 per document in input tokens. With the Batch API (50% discount), you can process 10,000 documents for roughly $2–$5. Sonnet costs about 5x more per token but delivers better accuracy on complex tables and multi-column layouts. Enable prompt caching on the document block if you query the same PDF more than once — cache hits cost 10% of normal input price.
Can I parse Word documents (.docx) or Excel files (.xlsx) with Claude?
Claude's native document type supports PDF and plain text. For Word and Excel files, convert to PDF first (using python-docx + reportlab, or LibreOffice headless), then send the PDF. Alternatively, extract text from .docx using python-docx and pass as a text content block. For spreadsheets, serialize to CSV and include as text — Claude handles CSV table parsing well without needing the binary format.
Sources
- Anthropic — PDF support documentation — April 2026
- Anthropic — Message Batches API — April 2026
- Anthropic — Vision capabilities — April 2026