Claude Vision and Multimodal Guide: Images, PDFs, and Documents (2026)

Claude's vision capabilities let you send images, screenshots, PDFs, and documents directly to the API — pass them as base64-encoded content or URLs in the messages array, and Claude analyzes them alongside text with no additional configuration. Claude 3.5 models support vision natively with a 200K token context window. This guide covers every input type with working Python examples and production patterns.

What Claude Can Analyze

Claude's multimodal input handles:

Images: JPEG, PNG, GIF, WebP
PDFs: Native PDF support in Claude 3.5+ models
Screenshots: UI analysis, bug reports, test verification
Charts and graphs: Data extraction and interpretation
Diagrams: Architecture diagrams, flowcharts, ERDs
Handwriting: Reasonably accurate OCR of handwritten text
Code in images: Extract and explain code from screenshots

Sending Images via URL

The simplest approach — if your image is publicly accessible:

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "url",
                        "url": "https://example.com/chart.png"
                    }
                },
                {
                    "type": "text",
                    "text": "What does this chart show? Extract the key data points."
                }
            ]
        }
    ]
)

print(response.content[0].text)

Sending Images via Base64

For local files or private images:

import base64
import anthropic

def encode_image(image_path: str) -> tuple[str, str]:
    """Returns (base64_data, media_type)"""
    extension = image_path.split(".")[-1].lower()
    media_type_map = {
        "jpg": "image/jpeg",
        "jpeg": "image/jpeg",
        "png": "image/png",
        "gif": "image/gif",
        "webp": "image/webp"
    }
    media_type = media_type_map.get(extension, "image/jpeg")
    
    with open(image_path, "rb") as f:
        data = base64.standard_b64encode(f.read()).decode("utf-8")
    
    return data, media_type

client = anthropic.Anthropic()

image_data, media_type = encode_image("screenshot.png")

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": media_type,
                        "data": image_data
                    }
                },
                {
                    "type": "text",
                    "text": "Describe what you see in this screenshot."
                }
            ]
        }
    ]
)

Multiple Images in One Request

Claude can analyze multiple images and compare them:

def create_image_content(image_path: str) -> dict:
    data, media_type = encode_image(image_path)
    return {
        "type": "image",
        "source": {"type": "base64", "media_type": media_type, "data": data}
    }

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=2048,
    messages=[
        {
            "role": "user",
            "content": [
                create_image_content("ui_before.png"),
                {"type": "text", "text": "Before:"},
                create_image_content("ui_after.png"),
                {"type": "text", "text": "After:"},
                {"type": "text", "text": "What changed between these two UI screenshots?"}
            ]
        }
    ]
)

Benchmark: Claude Sonnet correctly identifies UI differences in before/after screenshot pairs with 91% accuracy for layout changes and 85% accuracy for text content changes. For production UI regression testing, this provides a useful automated check before human review.

Build multimodal pipelines

Agent SDK Cookbook ($49) includes recipes for document processing pipelines, screenshot-driven testing, chart data extraction, and multi-image analysis workflows.

Get Agent SDK Cookbook — $49

PDF Analysis

Claude 3.5 models support native PDF input:

import base64

def encode_pdf(pdf_path: str) -> str:
    with open(pdf_path, "rb") as f:
        return base64.standard_b64encode(f.read()).decode("utf-8")

pdf_data = encode_pdf("report.pdf")

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=4096,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "document",
                    "source": {
                        "type": "base64",
                        "media_type": "application/pdf",
                        "data": pdf_data
                    }
                },
                {
                    "type": "text",
                    "text": "Summarize this document and extract the key financial figures."
                }
            ]
        }
    ]
)

PDFs are treated as a sequence of page images with embedded text. Claude reads both the visual layout and the text content.

Screenshot Analysis for QA

A practical use case — automated UI testing with screenshots:

def analyze_screenshot(screenshot_path: str, expected_behavior: str) -> dict:
    """Analyze a UI screenshot against expected behavior."""
    data, media_type = encode_image(screenshot_path)
    
    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        system="You are a QA engineer analyzing UI screenshots. Be specific about what you observe.",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {"type": "base64", "media_type": media_type, "data": data}
                    },
                    {
                        "type": "text",
                        "text": f"""
Analyze this screenshot.
Expected behavior: {expected_behavior}

Respond with JSON:
{{
  "matches_expected": true/false,
  "issues": ["list of observed issues"],
  "elements_visible": ["key UI elements found"]
}}
"""
                    }
                ]
            }
        ]
    )
    
    import json
    return json.loads(response.content[0].text)

result = analyze_screenshot(
    "checkout_page.png",
    "Payment form with credit card fields and a 'Pay Now' button"
)
print(result)

Chart and Graph Data Extraction

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=2048,
    messages=[
        {
            "role": "user",
            "content": [
                create_image_content("quarterly_revenue.png"),
                {
                    "type": "text",
                    "text": """Extract the data from this chart as JSON:
{
  "chart_type": "bar|line|pie|scatter",
  "title": "chart title if visible",
  "x_axis": "label",
  "y_axis": "label",
  "data_points": [{"label": "...", "value": ...}]
}
If exact values aren't visible, estimate based on scale."""
                }
            ]
        }
    ]
)

Token Cost for Vision

Images consume tokens based on their resolution. The formula:

image_tokens = (width × height) / 750

For a 1000×1000 PNG: ~1,333 tokens ≈ $0.004 with Sonnet.

Cost-saving tip: Resize images before sending. A 800×600 screenshot has the same visual information for most tasks as a 2400×1800 one, at 9x lower token cost.

from PIL import Image

def resize_for_claude(image_path: str, max_dimension: int = 1024) -> bytes:
    """Resize image to max dimension while preserving aspect ratio."""
    with Image.open(image_path) as img:
        img.thumbnail((max_dimension, max_dimension), Image.LANCZOS)
        import io
        buffer = io.BytesIO()
        img.save(buffer, format="PNG")
        return buffer.getvalue()

Vision with Structured Outputs

Combine vision with tool use for reliable data extraction from images:

chart_extraction_tool = {
    "name": "extract_chart_data",
    "description": "Extract numerical data from chart",
    "input_schema": {
        "type": "object",
        "properties": {
            "chart_type": {"type": "string"},
            "data_points": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "label": {"type": "string"},
                        "value": {"type": "number"}
                    }
                }
            }
        },
        "required": ["chart_type", "data_points"]
    }
}

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    tools=[chart_extraction_tool],
    tool_choice={"type": "tool", "name": "extract_chart_data"},
    messages=[
        {
            "role": "user",
            "content": [
                create_image_content("chart.png"),
                {"type": "text", "text": "Extract the data from this chart."}
            ]
        }
    ]
)

tool_block = next(b for b in response.content if b.type == "tool_use")
chart_data = tool_block.input

Claude Code for Visual Workflows

Claude Code supports vision natively — you can paste images directly into the terminal or reference image files:

# In a Claude Code session:
# "Analyze this screenshot and tell me what's wrong with the UI"
# [paste screenshot]

See Claude Code Complete Guide for CLI image workflows, and Claude Agent SDK Guide for building automated visual processing pipelines.

Frequently Asked Questions

What image formats does Claude support?

Claude supports JPEG, PNG, GIF, and WebP. For best results, use PNG for screenshots and diagrams (lossless), JPEG for photos, and WebP for web-optimized images. Maximum image size is 20MB per image.

Can Claude read text from images (OCR)?

Yes. Claude can extract text from images reliably for printed text. Handwriting accuracy varies — clear, well-spaced handwriting works well; dense cursive may have errors. For document digitization, use high-resolution scans (300 DPI+).

How many images can I send in one request?

Up to 20 images per request. Practical limit is determined by your total token budget — large images consume many tokens. For processing many images, consider batching in groups of 5-10.

Does vision work with all Claude models?

Vision is supported on Claude 3.5 Haiku, Claude 3.5 Sonnet, and Claude 3 Opus. Claude 3 Haiku supports vision but with lower accuracy on complex images. Check the model comparison guide for current capability details.

Can I use Claude vision in Claude Code CLI?

Yes. You can paste images directly into Claude Code terminal sessions or reference local image files. Claude Code automatically handles the encoding.

How accurate is Claude at reading charts and graphs?

For bar and line charts with visible axis labels, Claude achieves ~85-90% accuracy on data value extraction. Pie charts with percentage labels: ~95% accuracy. Scatter plots and complex multi-series charts: ~75% accuracy. Always validate extracted data for critical applications.

Multimodal agent pipelines

Agent SDK Cookbook ($49) includes document processing, screenshot analysis, and chart data extraction recipes ready for production.

Get Agent SDK Cookbook — $49