Claude PDF Analysis: How to Extract Information from Documents

Claude can analyse PDFs in two ways: send the PDF directly as a base64-encoded file (up to 32MB, supports tables and images) or extract the text first and send it as plain text (better cost control, required for PDFs over 32MB). For most document analysis tasks — contract review, invoice extraction, report summarisation — the direct PDF approach is simpler and handles formatting better. For very large document sets, text extraction + optional RAG is more cost-efficient.

Method 1: Send PDF directly to Claude (simplest)

import anthropic
import base64
from pathlib import Path

client = anthropic.Anthropic()

def analyse_pdf(pdf_path: str, question: str) -> str:
    """
    Send a PDF directly to Claude and ask a question about it.
    Supports PDFs up to 32MB.
    """
    pdf_data = Path(pdf_path).read_bytes()
    pdf_base64 = base64.standard_b64encode(pdf_data).decode("utf-8")
    
    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=2048,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "document",
                        "source": {
                            "type": "base64",
                            "media_type": "application/pdf",
                            "data": pdf_base64,
                        },
                    },
                    {
                        "type": "text",
                        "text": question,
                    },
                ],
            }
        ],
    )
    
    return response.content[0].text

# Usage
result = analyse_pdf(
    "contract.pdf",
    "What are the payment terms and the penalty for late payment?"
)
print(result)

Supported document types (via media_type):

application/pdf — PDF files
text/plain — plain text
text/html — HTML files
text/csv — CSV files
text/xml — XML files
application/msword — Word documents (.doc)

Method 2: Text extraction first (for large files / cost control)

For PDFs over 32MB, or when you want to minimise token costs:

import anthropic
from pypdf import PdfReader  # pip install pypdf

client = anthropic.Anthropic()

def extract_text_from_pdf(pdf_path: str) -> str:
    """Extract all text from a PDF file."""
    reader = PdfReader(pdf_path)
    pages = []
    
    for page_num, page in enumerate(reader.pages):
        text = page.extract_text()
        if text.strip():
            pages.append(f"[Page {page_num + 1}]\n{text}")
    
    return "\n\n".join(pages)

def analyse_pdf_text(pdf_path: str, question: str) -> str:
    """Extract PDF text and send to Claude."""
    text = extract_text_from_pdf(pdf_path)
    
    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": f"Document:\n\n{text}\n\nQuestion: {question}"
        }]
    )
    
    return response.content[0].text

When to use text extraction:

PDFs over 32MB
Batch processing many PDFs (extract once, query many times)
When you want precise token count control
When table formatting in the original doesn't matter

Structured data extraction from PDFs

For extracting specific fields (invoices, contracts, forms):

import json

def extract_invoice_data(pdf_path: str) -> dict:
    """Extract structured invoice data from a PDF."""
    pdf_data = base64.standard_b64encode(Path(pdf_path).read_bytes()).decode()
    
    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        system="""Extract invoice data and return ONLY valid JSON. No other text.

Schema:
{
  "invoice_number": "string",
  "date": "YYYY-MM-DD",
  "vendor": {"name": "string", "address": "string or null"},
  "bill_to": {"name": "string", "address": "string or null"},
  "line_items": [{"description": "string", "quantity": number, "unit_price": number, "total": number}],
  "subtotal": number,
  "tax": number,
  "total": number,
  "payment_due": "YYYY-MM-DD or null",
  "currency": "string (e.g. USD)"
}

Use null for missing optional fields. All monetary values as numbers.""",
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "document",
                    "source": {"type": "base64", "media_type": "application/pdf", "data": pdf_data},
                },
                {"type": "text", "text": "Extract the invoice data."},
            ],
        }]
    )
    
    try:
        return json.loads(response.content[0].text)
    except json.JSONDecodeError:
        return {"error": "Could not parse invoice", "raw": response.content[0].text[:200]}

# Process multiple invoices
import os
for filename in os.listdir("invoices/"):
    if filename.endswith(".pdf"):
        data = extract_invoice_data(f"invoices/{filename}")
        print(f"{filename}: ${data.get('total', 'error')}")

Batch PDF processing with the Batches API

For processing many PDFs at 50% cost:

def create_pdf_batch(pdf_paths: list[str], question: str) -> str:
    """Create a batch job for PDF analysis."""
    requests = []
    
    for i, path in enumerate(pdf_paths):
        pdf_data = base64.standard_b64encode(Path(path).read_bytes()).decode()
        
        requests.append({
            "custom_id": f"pdf-{i}-{Path(path).stem}",
            "params": {
                "model": "claude-sonnet-4-5",
                "max_tokens": 1024,
                "messages": [{
                    "role": "user",
                    "content": [
                        {
                            "type": "document",
                            "source": {"type": "base64", "media_type": "application/pdf", "data": pdf_data},
                        },
                        {"type": "text", "text": question},
                    ]
                }]
            }
        })
    
    batch = client.messages.batches.create(requests=requests)
    return batch.id

# Process 100 invoices at 50% cost
batch_id = create_pdf_batch(
    pdf_paths=invoice_files,
    question="Extract the total amount and payment due date as JSON."
)
print(f"Batch submitted: {batch_id}")

Handling scanned PDFs

PyPDF extracts text from text-based PDFs only. Scanned PDFs require OCR:

# For scanned PDFs, send directly to Claude (uses vision to read)
# Claude can read text from scanned documents
response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "document",
                "source": {"type": "base64", "media_type": "application/pdf", "data": pdf_base64},
            },
            {
                "type": "text",
                "text": "Read and transcribe all text visible in this scanned document.",
            },
        ],
    }]
)

Claude's vision capability reads scanned text from images within PDFs. For very low-quality scans, dedicated OCR (Google Cloud Vision, AWS Textract) may be more accurate.

Large document strategy (>200K tokens)

For very large documents that exceed Claude's context window:

def analyse_large_document(pdf_path: str, question: str) -> str:
    """
    For documents too large for Claude's context window.
    Strategy: extract text → chunk → retrieve relevant chunks → answer.
    """
    # 1. Extract text
    text = extract_text_from_pdf(pdf_path)
    
    # 2. Chunk (rough estimate: 100K chars ≈ 25K tokens, well within context)
    if len(text) < 400_000:  # ~100K tokens
        # Small enough for single context window
        response = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=2048,
            messages=[{"role": "user", "content": f"Document:\n{text}\n\nQuestion: {question}"}]
        )
        return response.content[0].text
    
    # 3. Too large — use RAG (see build-rag-system guide)
    from rag_utils import chunk_text, embed_texts, retrieve_relevant_chunks
    chunks = chunk_text(text)
    # ... continue with RAG pipeline
    return "Document processed via RAG"

Frequently asked questions

What's the maximum PDF size Claude can process? 32MB when sending as a base64-encoded document. For larger files, extract the text first using a PDF library and send as plain text.

Does Claude handle tables in PDFs? Yes, but with varying reliability. Native PDF tables (text-based) are read well. Complex multi-column layouts may be misinterpreted. For critical financial data, verify extracted table values.

Can Claude process password-protected PDFs? No. Remove password protection before sending to the API (with the PDF owner's permission).

Is it cheaper to send the PDF directly or extract text first? Text extraction is usually cheaper if you're processing the same document multiple times (extract once, reuse the text). For one-time analysis, direct PDF sending is simpler and equivalent in cost.

Does Claude retain my PDF content for training? Per Anthropic's privacy policy, API requests are not used to train models by default. If you're processing sensitive documents, confirm the current policy at anthropic.com/privacy.

Related guides

Build a RAG System with Claude: Python Implementation Guide — for document corpora too large for direct processing
Claude JSON Structured Output: Getting Reliable JSON Every Time — structured extraction from documents

Take It Further

Claude Agent SDK Cookbook: 40 Production Patterns — Pattern 20 covers the Document Processing Pipeline: batch PDF processing, invoice extraction with validation, large document RAG integration, and the monitoring system for tracking extraction accuracy at scale.

→ Get the Agent SDK Cookbook — $49

30-day money-back guarantee. Instant download.

Claude PDF Analysis: How to Extract Information from Documents

Method 1: Send PDF directly to Claude (simplest)

Method 2: Text extraction first (for large files / cost control)

Structured data extraction from PDFs

Batch PDF processing with the Batches API

Handling scanned PDFs

Large document strategy (>200K tokens)

Frequently asked questions

Related guides

Take It Further

Related guides

Claude API PDF & Document Parsing Guide

Claude for Customer Support Automation: Architecture and Implementation

Claude API Security: Protecting Your API Keys and Safe Integration Patterns

Claude API Rate Limits: Complete Production Guide

Tools and references