Prompt Caching in Claude Agent SDK: 90% Cost Reduction Guide
Prompt caching lets you reuse frequently-sent prompt prefixes at 10% of the normal input token price. In agent workflows where every turn re-sends a large system prompt or tool schema, caching typically cuts total input costs by 60–90%. A 10,000-token system prompt that is re-sent 100 times drops from $3.00 to $0.30 (Sonnet 3.7 pricing) — a $2.70 saving from a single cache_control block. For the full caching guide with non-agent use cases, see Prompt Caching: The 90% Discount Most Claude Developers Miss.
What Prompt Caching Actually Does
Every Claude API call sends tokens from scratch by default. Prompt caching intercepts specific prefix segments and stores them server-side for 5 minutes. Subsequent requests that hit the same prefix pay a cache read price instead of the full input price.
The price split for Claude Sonnet 3.7 (as of April 2026):
| Token type | Price per million tokens |
|---|---|
| Normal input | $3.00 |
| Cache write | $3.75 (1.25× — first time only) |
| Cache read | $0.30 (0.10× — every hit after) |
| Output | $15.00 (unchanged) |
The 5-minute TTL resets on every cache hit, so an active agent loop that calls Claude every 30–60 seconds will keep the cache warm indefinitely.
What Is Cacheable
Not everything can be cached. The content must appear at a prefix boundary — meaning everything before the cache_control marker must be identical across requests. Cacheable content includes:
- System prompts — the most common use case; often 500–5,000 tokens
- Tool definitions — tool schemas can be large, especially if you have 10+ tools with detailed descriptions
- Long documents or RAG context — if you inject a reference document that doesn't change between turns, cache it
- Few-shot examples — static examples in the user turn can be cached if they appear before the dynamic part
Dynamic per-turn content (the user's actual message, live data fetched per request) cannot be cached because the prefix won't match.
The cache_control Block
Caching is opt-in. You mark where the cacheable prefix ends by adding a cache_control object with type: "ephemeral" to the last content block in that prefix.
Caching a System Prompt
import anthropic
client = anthropic.Anthropic()
# System prompt cached on first call, read from cache on subsequent calls
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=[
{
"type": "text",
"text": """You are a senior software engineer assistant with deep expertise
in Python, distributed systems, and API design. You follow these principles:
1. Always explain your reasoning before writing code
2. Prefer explicit over implicit
3. Write tests alongside implementation
4. Flag security concerns proactively
[... 3,000 more tokens of system prompt ...]""",
"cache_control": {"type": "ephemeral"} # Mark end of cacheable prefix
}
],
messages=[
{"role": "user", "content": "How do I implement rate limiting in FastAPI?"}
]
)
# Check if cache was used
print(response.usage)
# Usage(input_tokens=12, cache_creation_input_tokens=3105, cache_read_input_tokens=0, output_tokens=...)
# Second call: cache_read_input_tokens=3105, cache_creation_input_tokens=0
On the first call you see cache_creation_input_tokens — you paid 1.25× for the write. On every subsequent call within 5 minutes you see cache_read_input_tokens — you pay 0.10×.
Caching Tool Definitions
Tool schemas are often 200–500 tokens each. With 8 tools that's 1,600–4,000 tokens re-sent per turn. Cache them:
tools = [
{
"name": "search_web",
"description": "Search the web for current information...",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"},
"num_results": {"type": "integer", "default": 5}
},
"required": ["query"]
}
},
# ... 7 more tools ...
{
"name": "write_file",
"description": "Write content to a file...",
"input_schema": {
"type": "object",
"properties": {
"path": {"type": "string"},
"content": {"type": "string"}
},
"required": ["path", "content"]
},
# Add cache_control to the LAST tool to cache all preceding tools too
"cache_control": {"type": "ephemeral"}
}
]
The cache_control on the last tool definition causes the entire tool list prefix (all 8 tools) to be cached as a unit.
Multi-Turn Conversation Caching Pattern
In an agentic loop, the conversation history grows with every turn. A naive implementation re-sends the entire history each time. With caching, you can mark the last assistant message in the history as cacheable, so the growing context stays cheap:
import anthropic
client = anthropic.Anthropic()
def run_agent_loop(system_prompt: str, initial_message: str, max_turns: int = 20):
messages = []
# Build initial message
messages.append({
"role": "user",
"content": initial_message
})
for turn in range(max_turns):
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=4096,
system=[
{
"type": "text",
"text": system_prompt,
"cache_control": {"type": "ephemeral"} # System prompt cached
}
],
messages=messages
)
assistant_message = response.content
# Add assistant response to history, marking it cacheable
messages.append({
"role": "assistant",
"content": [
{
**block.model_dump(),
# Cache the last assistant message so history prefix is cached
**({"cache_control": {"type": "ephemeral"}}
if i == len(assistant_message) - 1 else {})
}
for i, block in enumerate(assistant_message)
]
})
# Check for stop condition
if response.stop_reason == "end_turn":
break
# Get next user input (tool results, user follow-up, etc.)
next_input = handle_tool_calls_or_get_user_input(response)
messages.append({"role": "user", "content": next_input})
return messages, response
def handle_tool_calls_or_get_user_input(response):
# Your tool execution logic here
return "continue"
This pattern means each turn only pays full price for the new tokens added since the last cache point. In a 20-turn research loop, turns 2–20 each save the cost of re-reading turns 1–(n-1).
Cost Calculation: Before vs After
Here's a realistic agent scenario: a research agent with a 2,000-token system prompt, 6 tools (1,500 tokens total), running 50 turns in a session.
Without caching:
Per turn: (2,000 system + 1,500 tools + avg 2,000 history) = 5,500 input tokens
50 turns × 5,500 = 275,000 input tokens
Cost: 275,000 / 1,000,000 × $3.00 = $0.825 input cost
With caching (system + tools cached, history cached from turn 2):
Turn 1 (cache write):
Cache write: 3,500 tokens × $3.75/M = $0.013
Non-cached: 0 tokens
Turns 2–50 (cache reads):
Cache read: 3,500 tokens × $0.30/M = $0.001 per turn
New history tokens only: avg 200 new tokens × $3.00/M ≈ $0.0006 per turn
49 turns: 49 × $0.0016 = $0.078
Total input cost: $0.013 + $0.078 = $0.091
Savings: $0.825 → $0.091 = 89% reduction. Output token costs are unchanged, so total session savings depend on your output/input ratio, but for context-heavy agents the improvement is substantial.
Common Mistakes
1. Placing cache_control too early. The marker must be at the end of the static prefix. If you put it in the middle of your system prompt and the second half is dynamic, you'll get cache misses because the full prefix hash won't match.
2. Varying content before the cache point. Injecting a timestamp, session ID, or any per-request variable before the cache_control marker breaks caching entirely. Keep everything dynamic after the marker.
3. Not checking usage in development. Always log response.usage during development. If cache_read_input_tokens is always 0, your cache is never hitting — debug before scaling.
4. Forgetting the 5-minute TTL. If your agent has long pauses (waiting for user input, overnight runs), the cache will expire. For nightly batch jobs, every run is effectively a cache write. Prompt caching pays off most in continuous, high-frequency loops.
5. Caching with Beta header confusion. Prompt caching is generally available for Claude 3.5 and later models. You no longer need the anthropic-beta: prompt-caching-2024-07-31 header for current models — check the docs for your specific model version.
6. Caching dynamic RAG context. If your retrieved documents change per query, you can't cache them. Only cache documents that are static for the session (e.g., a codebase you loaded at session start).
Minimum Token Threshold
Claude enforces a minimum cacheable segment size:
- Claude 3.5+ models: 1,024 tokens minimum
- Claude 3 Haiku: 2,048 tokens minimum
If your system prompt is under 1,024 tokens, you won't get caching — expand it with more detailed instructions, few-shot examples, or move to a different optimization strategy.
Quick Implementation Checklist
- Identify your largest static prefix (usually system prompt + tools)
- Confirm it's over 1,024 tokens
- Add
"cache_control": {"type": "ephemeral"}to the last block of the static prefix - Log
response.usageand verifycache_read_input_tokens > 0on the second call - For multi-turn agents, also cache the growing message history
- Monitor cache hit rate in production; target >80% for cost-heavy workloads
FAQ
Does prompt caching work with streaming?
Yes. Streaming and prompt caching are fully compatible. Usage statistics (including cache metrics) are returned in the final message_delta event with usage data.
Can I cache across different users? Cache keys are scoped per API key and model. Two different users making requests with the same API key and identical prefixes will share the cache. This is safe for system prompts and tool definitions, but be careful not to accidentally cache user-specific data in the static prefix.
What happens when the cache expires? After 5 minutes of no hits, the cached segment is evicted. The next request pays the cache write price again (1.25×) to re-populate it. In practice this is barely noticeable — you pay one write cost per 5-minute window.
Does caching work with extended thinking (claude-3-7-sonnet)? Yes. Prompt caching and extended thinking can be used together. Cache your system prompt and tools as normal; the thinking tokens themselves are output tokens and are not cacheable.
Is there a limit to how many cache points I can have? You can have up to 4 cache breakpoints per request. Use them strategically: one for the system prompt, one for tool definitions, one for a large injected document, and one for conversation history. For the break-even analysis on when caching pays off, see Claude API Cost and Prompt Caching Break-Even.
Sources
- Anthropic Prompt Caching Documentation
- Claude API Pricing — Anthropic
- Claude Agent SDK Reference
- Messages API Reference — cache_control
→ Get Agent SDK Cookbook — $49
The cookbook includes 40+ production-ready agent patterns with prompt caching pre-wired, cost tracking built in, and worked examples for research agents, coding agents, and customer support pipelines.