← All guides

Prompt Caching: The 90% Discount Most Claude Developers Miss

Complete guide to Claude prompt caching — how it works, the exact cache write/read pricing, when to use it.

Prompt Caching: The 90% Discount Most Claude Developers Miss

Prompt caching reduces cache-read token costs by up to 90% compared to standard input pricing. It works by storing a prefix of your prompt on Anthropic's servers so that subsequent requests reuse the cached version instead of re-processing the full text. For any application that sends the same system prompt or context repeatedly — agents, chatbots, document Q&A systems — caching is the single highest-ROI optimisation available. For the full cost break-even analysis, see Claude API Cost and Prompt Caching Break-Even.

Most developers don't use it because it requires a one-line change to their API call. This guide shows exactly what to add.


How prompt caching works

When you include "cache_control": {"type": "ephemeral"} in a message block, Anthropic stores that block's content in a temporary cache on their edge servers. The cache entry:

On the first request (cache write), you pay a premium: 1.25× the standard input price. On every subsequent request that hits the cache (cache read), you pay only 0.1× the standard input price — a 90% discount.


Pricing breakdown (April 2026)

For claude-sonnet-4-6:

Token type Price per 1M tokens
Standard input $3.00
Cache write $3.75 (1.25×)
Cache read $0.30 (0.1×)
Output $15.00

Break-even analysis: You recover the cache-write premium after just 2 cache reads on the same content. Any application that processes the same document more than twice per 5-minute window is leaving money on the table.

For longer documents like a 50,000-token codebase or legal contract, the math becomes dramatic. At $3.00/M, reading that document 100 times costs $15.00. With caching, you pay $3.75 once and $0.30 × 99 times = $33.45... wait, that's higher? No — the break-even is 2 reads within the TTL window. See the real calculation:

Without caching (100 reads): 50,000 tokens × 100 × $3.00/M = $15.00 With caching (1 write + 99 reads): (50,000 × $3.75/M) + (50,000 × 99 × $0.30/M) = $0.1875 + $1.485 = $1.67

That's an 89% reduction for high-frequency document access.


When to use prompt caching

Prompt caching is most valuable when:

Scenario Cache target Expected savings
Chatbot with long system prompt System message 60–80%
Document Q&A (same doc, many questions) Document content 80–90%
Agent with fixed tool definitions Tool schemas 40–60%
Code review bot (same codebase) Repository context 85–95%
Multi-turn conversations Growing history 50–70%

It has minimal value for:


Implementation: basic example

Add cache_control to any content block you want cached. Here's a chatbot with a large system prompt:

import anthropic

client = anthropic.Anthropic()

SYSTEM_PROMPT = """You are a senior Python engineer...
[... 5,000 tokens of instructions, examples, and context ...]"""

def chat(user_message: str, conversation_history: list) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": SYSTEM_PROMPT,
                "cache_control": {"type": "ephemeral"},  # ← this one line
            }
        ],
        messages=conversation_history + [
            {"role": "user", "content": user_message}
        ],
    )
    return response.content[0].text

The first call pays 1.25× for the system prompt. Every call within the next 5 minutes pays 0.1×. In a busy support bot processing 100 tickets/hour, this alone can reduce costs by 70%.


Implementation: document Q&A

For document-heavy workloads, cache the document content in the user message:

def answer_question(document: str, question: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": document,
                        "cache_control": {"type": "ephemeral"},  # cache the doc
                    },
                    {
                        "type": "text",
                        "text": f"\n\nQuestion: {question}",
                        # no cache_control on the question — it changes each time
                    },
                ],
            }
        ],
    )
    return response.content[0].text

The document is cached for 5 minutes. Ask 10 questions about the same contract in rapid succession and you pay cache-read price for questions 2–10.


Implementation: multi-turn conversation caching

For long conversations, cache the accumulated history. The trick is to mark the entire conversation prefix as cacheable and let only the latest exchange be uncached:

def continue_conversation(messages: list[dict], new_user_message: str) -> str:
    # Mark all but the last user message as cached
    cached_messages = []
    for msg in messages[:-1]:  # previous turns
        cached_messages.append({
            **msg,
            "content": [
                {
                    "type": "text",
                    "text": msg["content"] if isinstance(msg["content"], str) else msg["content"][0]["text"],
                    "cache_control": {"type": "ephemeral"},
                }
            ],
        })
    
    # Add the new user message without caching
    cached_messages.append({"role": "user", "content": new_user_message})
    
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=cached_messages,
    )
    return response.content[0].text

Verifying cache usage in the response

Check the usage object to confirm caching is working:

response = client.messages.create(...)

print(f"Cache write tokens:  {response.usage.cache_creation_input_tokens}")
print(f"Cache read tokens:   {response.usage.cache_read_input_tokens}")
print(f"Uncached input:      {response.usage.input_tokens}")
print(f"Output tokens:       {response.usage.output_tokens}")

On a cache write: cache_creation_input_tokens > 0, cache_read_input_tokens == 0. On a cache hit: cache_creation_input_tokens == 0, cache_read_input_tokens > 0.

If you see cache_read_input_tokens == 0 every call, your cache isn't hitting — check that the cached content is identical across requests and above the minimum token threshold.


Common mistakes

Mistake 1: Caching content that changes every request The cache key is the exact content of the block. If your "system prompt" includes a timestamp or user-specific data, it will never hit the cache. Strip dynamic content out of cached blocks.

Mistake 2: Caching content below the minimum threshold Haiku requires 1,024 tokens minimum. Sonnet and Opus require 2,048. A 500-token system prompt won't be cached — you'll pay the write premium with no benefit. Check token counts with client.count_tokens().

Mistake 3: Relying on caching for compliance isolation Cache entries are scoped to your API key, not per-user. Don't cache user PII in shared prompts.

Mistake 4: Assuming 5-minute TTL is wall-clock The TTL resets on each cache hit. An active cache entry used every 2 minutes will persist indefinitely. A cache entry not accessed for 5 minutes expires.


Prompt caching in the Agent SDK

When building agents with the Anthropic SDK, cache your tool definitions — they're typically large and identical across every step of the agent loop. See Prompt Caching in Claude Agent SDK for a complete agent-loop implementation:

tools = [
    {
        "name": "search",
        "description": "Search the web...",
        "input_schema": {...},
        "cache_control": {"type": "ephemeral"},  # cache tool schemas
    },
    # ... more tools
]

response = client.messages.create(
    model="claude-sonnet-4-6",
    tools=tools,
    messages=messages,
)

In a 10-step agent loop, caching tool definitions saves ~9 cache-read costs instead of 9 full-input costs. For 20-tool schemas, this is significant.


Frequently Asked Questions

Does prompt caching work with all Claude models? Yes — Haiku, Sonnet, and Opus all support cache_control. Minimum token thresholds differ (1,024 for Haiku, 2,048 for Sonnet/Opus).

Can I cache images or documents (PDFs)? Yes. You can add cache_control to image blocks and document blocks (base64-encoded PDFs) the same way as text. Useful for document analysis pipelines.

Is prompt caching available in the Batch API? Yes. Batch API requests support cache_control. Combined with the 50% Batch API discount, you can achieve very low cost for async document-processing pipelines.

Does streaming affect caching? No. Caching works identically with streaming responses. The usage object in the final message_delta event shows cache statistics.

What happens when the cache expires? The next request pays the cache-write premium again to re-populate the cache. You don't need to handle expiry explicitly — it's transparent.


Summary: the implementation checklist

This is a 15-minute implementation for a 60–90% reduction in input token costs on high-frequency workloads.


Take It Further

Claude API Cost Optimization Masterclass — 12 production deployments analyzed. Prompt caching is Module 1. Also covers model tiering, Batch API, and token compression. 120-page PDF + 6-sheet Excel cost calculator.

→ Get the Cost Optimization Masterclass — $59

30-day money-back guarantee. Instant download.

AI Disclosure: Drafted with Claude Code; all pricing from Anthropic official pricing page as of April 2026.

Tools and references