← All guides

Prompt Caching in Claude Agent SDK: 90% Cost Reduction Guide

How to implement prompt caching in Claude Agent SDK to cut costs by up to 90%. Cache types, TTL, implementation patterns, and real cost calculations.

Prompt Caching in Claude Agent SDK: 90% Cost Reduction Guide

Prompt caching lets you reuse frequently-sent prompt prefixes at 10% of the normal input token price. In agent workflows where every turn re-sends a large system prompt or tool schema, caching typically cuts total input costs by 60–90%. A 10,000-token system prompt that is re-sent 100 times drops from $3.00 to $0.30 (Sonnet 3.7 pricing) — a $2.70 saving from a single cache_control block. For the full caching guide with non-agent use cases, see Prompt Caching: The 90% Discount Most Claude Developers Miss.


What Prompt Caching Actually Does

Every Claude API call sends tokens from scratch by default. Prompt caching intercepts specific prefix segments and stores them server-side for 5 minutes. Subsequent requests that hit the same prefix pay a cache read price instead of the full input price.

The price split for Claude Sonnet 3.7 (as of April 2026):

Token type Price per million tokens
Normal input $3.00
Cache write $3.75 (1.25× — first time only)
Cache read $0.30 (0.10× — every hit after)
Output $15.00 (unchanged)

The 5-minute TTL resets on every cache hit, so an active agent loop that calls Claude every 30–60 seconds will keep the cache warm indefinitely.


What Is Cacheable

Not everything can be cached. The content must appear at a prefix boundary — meaning everything before the cache_control marker must be identical across requests. Cacheable content includes:

Dynamic per-turn content (the user's actual message, live data fetched per request) cannot be cached because the prefix won't match.


The cache_control Block

Caching is opt-in. You mark where the cacheable prefix ends by adding a cache_control object with type: "ephemeral" to the last content block in that prefix.

Caching a System Prompt

import anthropic

client = anthropic.Anthropic()

# System prompt cached on first call, read from cache on subsequent calls
response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": """You are a senior software engineer assistant with deep expertise 
            in Python, distributed systems, and API design. You follow these principles:
            
            1. Always explain your reasoning before writing code
            2. Prefer explicit over implicit
            3. Write tests alongside implementation
            4. Flag security concerns proactively
            
            [... 3,000 more tokens of system prompt ...]""",
            "cache_control": {"type": "ephemeral"}  # Mark end of cacheable prefix
        }
    ],
    messages=[
        {"role": "user", "content": "How do I implement rate limiting in FastAPI?"}
    ]
)

# Check if cache was used
print(response.usage)
# Usage(input_tokens=12, cache_creation_input_tokens=3105, cache_read_input_tokens=0, output_tokens=...)
# Second call: cache_read_input_tokens=3105, cache_creation_input_tokens=0

On the first call you see cache_creation_input_tokens — you paid 1.25× for the write. On every subsequent call within 5 minutes you see cache_read_input_tokens — you pay 0.10×.

Caching Tool Definitions

Tool schemas are often 200–500 tokens each. With 8 tools that's 1,600–4,000 tokens re-sent per turn. Cache them:

tools = [
    {
        "name": "search_web",
        "description": "Search the web for current information...",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Search query"},
                "num_results": {"type": "integer", "default": 5}
            },
            "required": ["query"]
        }
    },
    # ... 7 more tools ...
    {
        "name": "write_file",
        "description": "Write content to a file...",
        "input_schema": {
            "type": "object",
            "properties": {
                "path": {"type": "string"},
                "content": {"type": "string"}
            },
            "required": ["path", "content"]
        },
        # Add cache_control to the LAST tool to cache all preceding tools too
        "cache_control": {"type": "ephemeral"}
    }
]

The cache_control on the last tool definition causes the entire tool list prefix (all 8 tools) to be cached as a unit.


Multi-Turn Conversation Caching Pattern

In an agentic loop, the conversation history grows with every turn. A naive implementation re-sends the entire history each time. With caching, you can mark the last assistant message in the history as cacheable, so the growing context stays cheap:

import anthropic

client = anthropic.Anthropic()

def run_agent_loop(system_prompt: str, initial_message: str, max_turns: int = 20):
    messages = []
    
    # Build initial message
    messages.append({
        "role": "user",
        "content": initial_message
    })
    
    for turn in range(max_turns):
        response = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=4096,
            system=[
                {
                    "type": "text",
                    "text": system_prompt,
                    "cache_control": {"type": "ephemeral"}  # System prompt cached
                }
            ],
            messages=messages
        )
        
        assistant_message = response.content
        
        # Add assistant response to history, marking it cacheable
        messages.append({
            "role": "assistant",
            "content": [
                {
                    **block.model_dump(),
                    # Cache the last assistant message so history prefix is cached
                    **({"cache_control": {"type": "ephemeral"}} 
                       if i == len(assistant_message) - 1 else {})
                }
                for i, block in enumerate(assistant_message)
            ]
        })
        
        # Check for stop condition
        if response.stop_reason == "end_turn":
            break
            
        # Get next user input (tool results, user follow-up, etc.)
        next_input = handle_tool_calls_or_get_user_input(response)
        messages.append({"role": "user", "content": next_input})
    
    return messages, response


def handle_tool_calls_or_get_user_input(response):
    # Your tool execution logic here
    return "continue"

This pattern means each turn only pays full price for the new tokens added since the last cache point. In a 20-turn research loop, turns 2–20 each save the cost of re-reading turns 1–(n-1).


Cost Calculation: Before vs After

Here's a realistic agent scenario: a research agent with a 2,000-token system prompt, 6 tools (1,500 tokens total), running 50 turns in a session.

Without caching:

Per turn: (2,000 system + 1,500 tools + avg 2,000 history) = 5,500 input tokens
50 turns × 5,500 = 275,000 input tokens
Cost: 275,000 / 1,000,000 × $3.00 = $0.825 input cost

With caching (system + tools cached, history cached from turn 2):

Turn 1 (cache write):
  Cache write: 3,500 tokens × $3.75/M = $0.013
  Non-cached: 0 tokens
  
Turns 2–50 (cache reads):
  Cache read: 3,500 tokens × $0.30/M = $0.001 per turn
  New history tokens only: avg 200 new tokens × $3.00/M ≈ $0.0006 per turn
  49 turns: 49 × $0.0016 = $0.078

Total input cost: $0.013 + $0.078 = $0.091

Savings: $0.825 → $0.091 = 89% reduction. Output token costs are unchanged, so total session savings depend on your output/input ratio, but for context-heavy agents the improvement is substantial.


Common Mistakes

1. Placing cache_control too early. The marker must be at the end of the static prefix. If you put it in the middle of your system prompt and the second half is dynamic, you'll get cache misses because the full prefix hash won't match.

2. Varying content before the cache point. Injecting a timestamp, session ID, or any per-request variable before the cache_control marker breaks caching entirely. Keep everything dynamic after the marker.

3. Not checking usage in development. Always log response.usage during development. If cache_read_input_tokens is always 0, your cache is never hitting — debug before scaling.

4. Forgetting the 5-minute TTL. If your agent has long pauses (waiting for user input, overnight runs), the cache will expire. For nightly batch jobs, every run is effectively a cache write. Prompt caching pays off most in continuous, high-frequency loops.

5. Caching with Beta header confusion. Prompt caching is generally available for Claude 3.5 and later models. You no longer need the anthropic-beta: prompt-caching-2024-07-31 header for current models — check the docs for your specific model version.

6. Caching dynamic RAG context. If your retrieved documents change per query, you can't cache them. Only cache documents that are static for the session (e.g., a codebase you loaded at session start).


Minimum Token Threshold

Claude enforces a minimum cacheable segment size:

If your system prompt is under 1,024 tokens, you won't get caching — expand it with more detailed instructions, few-shot examples, or move to a different optimization strategy.


Quick Implementation Checklist


FAQ

Does prompt caching work with streaming? Yes. Streaming and prompt caching are fully compatible. Usage statistics (including cache metrics) are returned in the final message_delta event with usage data.

Can I cache across different users? Cache keys are scoped per API key and model. Two different users making requests with the same API key and identical prefixes will share the cache. This is safe for system prompts and tool definitions, but be careful not to accidentally cache user-specific data in the static prefix.

What happens when the cache expires? After 5 minutes of no hits, the cached segment is evicted. The next request pays the cache write price again (1.25×) to re-populate it. In practice this is barely noticeable — you pay one write cost per 5-minute window.

Does caching work with extended thinking (claude-3-7-sonnet)? Yes. Prompt caching and extended thinking can be used together. Cache your system prompt and tools as normal; the thinking tokens themselves are output tokens and are not cacheable.

Is there a limit to how many cache points I can have? You can have up to 4 cache breakpoints per request. Use them strategically: one for the system prompt, one for tool definitions, one for a large injected document, and one for conversation history. For the break-even analysis on when caching pays off, see Claude API Cost and Prompt Caching Break-Even.


Sources


→ Get Agent SDK Cookbook — $49

The cookbook includes 40+ production-ready agent patterns with prompt caching pre-wired, cost tracking built in, and worked examples for research agents, coding agents, and customer support pipelines.

AI Disclosure: Drafted with Claude Code; all pricing and feature details from official documentation as of April 2026.

Tools and references