Prompt Caching: The 90% Discount Most Claude Developers Miss
Prompt caching reduces cache-read token costs by up to 90% compared to standard input pricing. It works by storing a prefix of your prompt on Anthropic's servers so that subsequent requests reuse the cached version instead of re-processing the full text. For any application that sends the same system prompt or context repeatedly — agents, chatbots, document Q&A systems — caching is the single highest-ROI optimisation available. For the full cost break-even analysis, see Claude API Cost and Prompt Caching Break-Even.
Most developers don't use it because it requires a one-line change to their API call. This guide shows exactly what to add.
How prompt caching works
When you include "cache_control": {"type": "ephemeral"} in a message block, Anthropic stores that block's content in a temporary cache on their edge servers. The cache entry:
- Lives for 5 minutes (TTL resets each time it's hit)
- Is scoped to your API key — no other users share your cache
- Requires a minimum size of 1,024 tokens (Haiku) or 2,048 tokens (Sonnet/Opus) to be worth caching
On the first request (cache write), you pay a premium: 1.25× the standard input price. On every subsequent request that hits the cache (cache read), you pay only 0.1× the standard input price — a 90% discount.
Pricing breakdown (April 2026)
For claude-sonnet-4-6:
| Token type | Price per 1M tokens |
|---|---|
| Standard input | $3.00 |
| Cache write | $3.75 (1.25×) |
| Cache read | $0.30 (0.1×) |
| Output | $15.00 |
Break-even analysis: You recover the cache-write premium after just 2 cache reads on the same content. Any application that processes the same document more than twice per 5-minute window is leaving money on the table.
For longer documents like a 50,000-token codebase or legal contract, the math becomes dramatic. At $3.00/M, reading that document 100 times costs $15.00. With caching, you pay $3.75 once and $0.30 × 99 times = $33.45... wait, that's higher? No — the break-even is 2 reads within the TTL window. See the real calculation:
Without caching (100 reads): 50,000 tokens × 100 × $3.00/M = $15.00 With caching (1 write + 99 reads): (50,000 × $3.75/M) + (50,000 × 99 × $0.30/M) = $0.1875 + $1.485 = $1.67
That's an 89% reduction for high-frequency document access.
When to use prompt caching
Prompt caching is most valuable when:
| Scenario | Cache target | Expected savings |
|---|---|---|
| Chatbot with long system prompt | System message | 60–80% |
| Document Q&A (same doc, many questions) | Document content | 80–90% |
| Agent with fixed tool definitions | Tool schemas | 40–60% |
| Code review bot (same codebase) | Repository context | 85–95% |
| Multi-turn conversations | Growing history | 50–70% |
It has minimal value for:
- Single-turn, one-off completions
- Short prompts under the minimum token threshold
- Requests more than 5 minutes apart
Implementation: basic example
Add cache_control to any content block you want cached. Here's a chatbot with a large system prompt:
import anthropic
client = anthropic.Anthropic()
SYSTEM_PROMPT = """You are a senior Python engineer...
[... 5,000 tokens of instructions, examples, and context ...]"""
def chat(user_message: str, conversation_history: list) -> str:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"}, # ← this one line
}
],
messages=conversation_history + [
{"role": "user", "content": user_message}
],
)
return response.content[0].text
The first call pays 1.25× for the system prompt. Every call within the next 5 minutes pays 0.1×. In a busy support bot processing 100 tickets/hour, this alone can reduce costs by 70%.
Implementation: document Q&A
For document-heavy workloads, cache the document content in the user message:
def answer_question(document: str, question: str) -> str:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": document,
"cache_control": {"type": "ephemeral"}, # cache the doc
},
{
"type": "text",
"text": f"\n\nQuestion: {question}",
# no cache_control on the question — it changes each time
},
],
}
],
)
return response.content[0].text
The document is cached for 5 minutes. Ask 10 questions about the same contract in rapid succession and you pay cache-read price for questions 2–10.
Implementation: multi-turn conversation caching
For long conversations, cache the accumulated history. The trick is to mark the entire conversation prefix as cacheable and let only the latest exchange be uncached:
def continue_conversation(messages: list[dict], new_user_message: str) -> str:
# Mark all but the last user message as cached
cached_messages = []
for msg in messages[:-1]: # previous turns
cached_messages.append({
**msg,
"content": [
{
"type": "text",
"text": msg["content"] if isinstance(msg["content"], str) else msg["content"][0]["text"],
"cache_control": {"type": "ephemeral"},
}
],
})
# Add the new user message without caching
cached_messages.append({"role": "user", "content": new_user_message})
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=cached_messages,
)
return response.content[0].text
Verifying cache usage in the response
Check the usage object to confirm caching is working:
response = client.messages.create(...)
print(f"Cache write tokens: {response.usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")
print(f"Uncached input: {response.usage.input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")
On a cache write: cache_creation_input_tokens > 0, cache_read_input_tokens == 0.
On a cache hit: cache_creation_input_tokens == 0, cache_read_input_tokens > 0.
If you see cache_read_input_tokens == 0 every call, your cache isn't hitting — check that the cached content is identical across requests and above the minimum token threshold.
Common mistakes
Mistake 1: Caching content that changes every request The cache key is the exact content of the block. If your "system prompt" includes a timestamp or user-specific data, it will never hit the cache. Strip dynamic content out of cached blocks.
Mistake 2: Caching content below the minimum threshold
Haiku requires 1,024 tokens minimum. Sonnet and Opus require 2,048. A 500-token system prompt won't be cached — you'll pay the write premium with no benefit. Check token counts with client.count_tokens().
Mistake 3: Relying on caching for compliance isolation Cache entries are scoped to your API key, not per-user. Don't cache user PII in shared prompts.
Mistake 4: Assuming 5-minute TTL is wall-clock The TTL resets on each cache hit. An active cache entry used every 2 minutes will persist indefinitely. A cache entry not accessed for 5 minutes expires.
Prompt caching in the Agent SDK
When building agents with the Anthropic SDK, cache your tool definitions — they're typically large and identical across every step of the agent loop. See Prompt Caching in Claude Agent SDK for a complete agent-loop implementation:
tools = [
{
"name": "search",
"description": "Search the web...",
"input_schema": {...},
"cache_control": {"type": "ephemeral"}, # cache tool schemas
},
# ... more tools
]
response = client.messages.create(
model="claude-sonnet-4-6",
tools=tools,
messages=messages,
)
In a 10-step agent loop, caching tool definitions saves ~9 cache-read costs instead of 9 full-input costs. For 20-tool schemas, this is significant.
Frequently Asked Questions
Does prompt caching work with all Claude models?
Yes — Haiku, Sonnet, and Opus all support cache_control. Minimum token thresholds differ (1,024 for Haiku, 2,048 for Sonnet/Opus).
Can I cache images or documents (PDFs)?
Yes. You can add cache_control to image blocks and document blocks (base64-encoded PDFs) the same way as text. Useful for document analysis pipelines.
Is prompt caching available in the Batch API?
Yes. Batch API requests support cache_control. Combined with the 50% Batch API discount, you can achieve very low cost for async document-processing pipelines.
Does streaming affect caching?
No. Caching works identically with streaming responses. The usage object in the final message_delta event shows cache statistics.
What happens when the cache expires? The next request pays the cache-write premium again to re-populate the cache. You don't need to handle expiry explicitly — it's transparent.
Summary: the implementation checklist
- Identify your largest, most-repeated prompt block (usually system prompt or document)
- Verify it exceeds the minimum token threshold (1,024 / 2,048)
- Add
"cache_control": {"type": "ephemeral"}to that block - Verify
cache_read_input_tokens > 0in the usage response after the second call - Monitor cost savings in your logging/observability system
This is a 15-minute implementation for a 60–90% reduction in input token costs on high-frequency workloads.
Take It Further
Claude API Cost Optimization Masterclass — 12 production deployments analyzed. Prompt caching is Module 1. Also covers model tiering, Batch API, and token compression. 120-page PDF + 6-sheet Excel cost calculator.
→ Get the Cost Optimization Masterclass — $59
30-day money-back guarantee. Instant download.