Claude prompt caching: when it pays off and when it doesn't (2026 numbers)
Claude prompt caching breaks even at 1.28 reuses for the 5-minute cache and 4 reuses for the 1-hour cache β below those thresholds, you pay 25% more than not caching. Above them, you save up to 90% on input tokens. This post derives the break-even math from 2026 pricing and walks through six real workloads to show where caching wins, breaks even, and loses.
For the complete pricing table this analysis is based on, see Claude API pricing 2026.
The pricing (April 2026)
Per 1M tokens, in USD:
| Model | Input | Output | Cache write 5m | Cache write 1h | Cache read |
|---|---|---|---|---|---|
| Opus 4.7 | $5 | $25 | $6.25 | $10 | $0.50 |
| Sonnet 4.6 | $3 | $15 | $3.75 | $6 | $0.30 |
| Haiku 4.5 | $1 | $5 | $1.25 | $2 | $0.10 |
Cache write 5m = 1.25x input price. Cache write 1h = 2x input price. Cache read = 0.1x input price.
The break-even formula
For a prefix of size P tokens reused N times:
- Without cache:
N * P * input_price - With cache:
1 * P * cache_write_price + N * P * cache_read_price
Caching is cheaper when:
N * P * input > P * cache_write + N * P * cache_read
β N * input > cache_write + N * cache_read
β N * (input - cache_read) > cache_write
β N > cache_write / (input - cache_read)
Plugging in:
- 5-minute cache:
N > 1.25x / 0.9x = 1.39β round up, 2 reuses - 1-hour cache:
N > 2.0x / 0.9x = 2.22β round up, 3 reuses (4 to be safe with rounding)
So: 2 reads for 5m, 3-4 reads for 1h. Below that, skip caching.
Six real workload examples
We ran these on our own stack in April 2026.
1. Support chatbot with 8K-token system prompt, 50 users/hr
Cache TTL: 5m. Average reuse: 12/hr within each 5-min window.
- Without cache: $0.40/hr
- With cache (5m): $0.08/hr
- Savings: 80%. Clear win.
2. One-shot code reviewer, 30K-token diff, 1 call per PR
Cache TTL: n/a. No reuse.
- Skip caching. Any cache write cost is pure loss.
3. RAG pipeline with 20K-token retrieved context, 1h cache
Cache TTL: 1h. Reuses depend on deduplication β often 1-2 per hour.
- Rarely beats 1h break-even (needs 3-4 reuses).
- If you can collapse to 5m windows with bursty users, switch to 5m TTL.
4. Agent with 15K-token tool manifest, 5m cache, long conversation
Cache TTL: 5m. Average 8 tool-call roundtrips in 5 min.
- Without cache: $0.60 per session
- With cache: $0.14 per session
- Savings: 77%.
5. Batch classifier, 200 items, 10K-token instruction prefix
Cache TTL: 5m. Items processed serially within the 5-min window.
- Without cache: $2.00
- With cache: $0.32
- Savings: 84%. Strongly prefer Batch API for another 50% on top.
6. Evaluation harness, 40K-token rubric, 500 test cases
Cache TTL: 1h (runs take ~30 min). Reuse = 500.
- Without cache: $100
- With cache: $20.20
- Savings: 80%. Even better β run it with Haiku if the task allows.
Common mistakes we made
- Caching too early. A new feature with uncertain reuse is safer uncached until you see the pattern.
- Wrong TTL. Paying 2x for 1h when your actual reuse window is 5 minutes wastes the difference.
- Ignoring minimum cache size. Haiku requires β₯1024 tokens cached; Sonnet/Opus require β₯2048. Short prefixes get silently ignored by the API.
- Cache invalidation confusion. Changing even one byte in the cached prefix produces a new cache. Keep the prefix byte-stable.
- Forgetting cost of cache writes on cold starts. First request of the day pays the 1.25x premium even if nothing reuses it.
Decision tree
Is prefix >= 1024 (Haiku) or 2048 (Sonnet/Opus) tokens?
No β skip caching
Yes β Expected reuses within 5 min?
< 2 β skip caching or use 1h if reuses land within the hour
>= 2 β Expected reuses within 1 hour?
< 4 β use 5m cache
>= 4 β use 1h cache
See also
- Cost & performance benchmark β single-page citation source for all measured numbers across the site.
- Claude API Cost Calculator β interactive estimator with the optimizations in this article.
FAQ
Does the cache apply to system prompt only?
No. It applies to any prefix you mark as cacheable β system prompt, tools, early user messages. Everything after the cache point is billed at regular rates.
Can I have multiple cache points?
Yes, up to 4 cache_control breakpoints. Useful for layered prefixes (system + tools + static context + dynamic).
Does caching help latency?
Yes β measurably. Cached reads typically respond 30-60% faster because the model doesn't re-process the prefix.
Does Batch API stack with caching?
Yes. Batch is 50% off the final per-token price, applied on top of cache discounts. For a step-by-step guide to implementing prompt caching in agent SDK projects specifically, see the prompt caching agent SDK guide.
What about extended thinking?
Extended thinking tokens are billed as output. Caching doesn't change the output portion. But it does reduce the cost of the long system prompt that precedes a thinking-heavy task.
Reproducing these numbers
The repo at github.com/claudeguide/caching-break-even (link added post-publication) contains the raw measurement scripts. Each example is a single command that prints tokens in/out, cache hits, and the per-request cost. The numbers above are averages over 100 runs per scenario.
Part of the Claude API cost optimization series on claudeguide.io. If you want a dashboard that computes this for you automatically across your entire Anthropic usage, claudecosts.app is live and free β connect your Admin key, see daily spend by model. For broader strategies to keep agent costs in check beyond caching, see how to limit Claude agent costs.
Related guides
- Prompt Caching: The 90% Discount Most Devs Miss β full implementation guide with Python code examples
- How to Limit Claude Agent Costs β model tiering, Batch API, and other cost levers
- Claude API Pricing 2026 β full price breakdown across all models
Frequently Asked Questions
How many times does a prompt need to be reused before caching saves money?
For the 5-minute cache, you need at least 2 reuses of the same prefix β the break-even is 1.28. For the 1-hour cache, you need at least 3β4 reuses because the write premium is 2x input price instead of 1.25x. Below those thresholds, you pay more with caching than without.
What is the minimum prefix size required for Claude prompt caching?
Haiku requires at least 1,024 tokens in the cached prefix. Sonnet and Opus require at least 2,048 tokens. Shorter prefixes are silently ignored by the API β the cache control is accepted but no caching occurs, and you pay normal input rates.
Does Claude prompt caching also reduce latency?
Yes. Cached reads typically respond 30β60% faster than uncached requests because the model does not need to re-process the prefix tokens. This latency benefit is in addition to the 90% cost reduction on cached input tokens.
Can I cache both my system prompt and tool definitions?
Yes. You can place cache_control: {"type": "ephemeral"} on up to 4 breakpoints β for example, one after the system prompt and one after the last tool definition. Everything before each breakpoint is cached. Tool schemas at 100β300 tokens each benefit significantly from caching at high request volumes.
Take It Further
Claude API Cost Optimization Masterclass β Cut your Claude API bill by 60β90% without sacrificing quality. 12 optimization scenarios analyzed. The concrete order-of-operations: prompt caching, model tiering, Batch API, token compression.
PDF guide + 6-sheet Excel cost calculator. Example scenario: $2,100 β $187/month on a customer support agent.
β Get Cost Optimization Masterclass β $59
30-day money-back guarantee. Instant download.