Claude prompt caching: when it pays off and when it doesn't (2026 numbers)

Q: Can I have multiple cache points?

Yes, up to 4 cache_control breakpoints. Useful for layered prefixes (system + tools + static context + dynamic).

Claude prompt caching breaks even at 1.28 reuses for the 5-minute cache and 4 reuses for the 1-hour cache — below those thresholds, you pay 25% more than not caching. Above them, you save up to 90% on input tokens. This post derives the break-even math from 2026 pricing and walks through six real workloads to show where caching wins, breaks even, and loses.

For the complete pricing table this analysis is based on, see Claude API pricing 2026.

The pricing (April 2026)

Per 1M tokens, in USD:

Model	Input	Output	Cache write 5m	Cache write 1h	Cache read
Opus 4.7	$5	$25	$6.25	$10	$0.50
Sonnet 4.6	$3	$15	$3.75	$6	$0.30
Haiku 4.5	$1	$5	$1.25	$2	$0.10

Cache write 5m = 1.25x input price. Cache write 1h = 2x input price. Cache read = 0.1x input price.

The break-even formula

For a prefix of size P tokens reused N times:

Without cache: N * P * input_price
With cache: 1 * P * cache_write_price + N * P * cache_read_price

Caching is cheaper when:

N * P * input > P * cache_write + N * P * cache_read
⇔ N * input > cache_write + N * cache_read
⇔ N * (input - cache_read) > cache_write
⇔ N > cache_write / (input - cache_read)

Plugging in:

5-minute cache: N > 1.25x / 0.9x = 1.39 → round up, 2 reuses
1-hour cache: N > 2.0x / 0.9x = 2.22 → round up, 3 reuses (4 to be safe with rounding)

So: 2 reads for 5m, 3-4 reads for 1h. Below that, skip caching.

Six real workload examples

We ran these on our own stack in April 2026.

1. Support chatbot with 8K-token system prompt, 50 users/hr

Cache TTL: 5m. Average reuse: 12/hr within each 5-min window.

Without cache: $0.40/hr
With cache (5m): $0.08/hr
Savings: 80%. Clear win.

2. One-shot code reviewer, 30K-token diff, 1 call per PR

Cache TTL: n/a. No reuse.

Skip caching. Any cache write cost is pure loss.

3. RAG pipeline with 20K-token retrieved context, 1h cache

Cache TTL: 1h. Reuses depend on deduplication — often 1-2 per hour.

Rarely beats 1h break-even (needs 3-4 reuses).
If you can collapse to 5m windows with bursty users, switch to 5m TTL.

4. Agent with 15K-token tool manifest, 5m cache, long conversation

Cache TTL: 5m. Average 8 tool-call roundtrips in 5 min.

Without cache: $0.60 per session
With cache: $0.14 per session
Savings: 77%.

5. Batch classifier, 200 items, 10K-token instruction prefix

Cache TTL: 5m. Items processed serially within the 5-min window.

Without cache: $2.00
With cache: $0.32
Savings: 84%. Strongly prefer Batch API for another 50% on top.

6. Evaluation harness, 40K-token rubric, 500 test cases

Cache TTL: 1h (runs take ~30 min). Reuse = 500.

Without cache: $100
With cache: $20.20
Savings: 80%. Even better — run it with Haiku if the task allows.

Common mistakes we made

Caching too early. A new feature with uncertain reuse is safer uncached until you see the pattern.
Wrong TTL. Paying 2x for 1h when your actual reuse window is 5 minutes wastes the difference.
Ignoring minimum cache size. Haiku requires ≥1024 tokens cached; Sonnet/Opus require ≥2048. Short prefixes get silently ignored by the API.
Cache invalidation confusion. Changing even one byte in the cached prefix produces a new cache. Keep the prefix byte-stable.
Forgetting cost of cache writes on cold starts. First request of the day pays the 1.25x premium even if nothing reuses it.

Decision tree

Is prefix >= 1024 (Haiku) or 2048 (Sonnet/Opus) tokens?
  No  → skip caching
  Yes → Expected reuses within 5 min?
         < 2  → skip caching or use 1h if reuses land within the hour
         >= 2 → Expected reuses within 1 hour?
                 < 4  → use 5m cache
                 >= 4 → use 1h cache

FAQ

Does the cache apply to system prompt only?

No. It applies to any prefix you mark as cacheable — system prompt, tools, early user messages. Everything after the cache point is billed at regular rates.

Can I have multiple cache points?

Yes, up to 4 cache_control breakpoints. Useful for layered prefixes (system + tools + static context + dynamic).

Does caching help latency?

Yes — measurably. Cached reads typically respond 30-60% faster because the model doesn't re-process the prefix.

Does Batch API stack with caching?

Yes. Batch is 50% off the final per-token price, applied on top of cache discounts. For a step-by-step guide to implementing prompt caching in agent SDK projects specifically, see the prompt caching agent SDK guide.

What about extended thinking?

Extended thinking tokens are billed as output. Caching doesn't change the output portion. But it does reduce the cost of the long system prompt that precedes a thinking-heavy task.

Reproducing these numbers

The repo at github.com/claudeguide/caching-break-even (link added post-publication) contains the raw measurement scripts. Each example is a single command that prints tokens in/out, cache hits, and the per-request cost. The numbers above are averages over 100 runs per scenario.

Part of the Claude API cost optimization series on claudeguide.io. If you want a dashboard that computes this for you automatically across your entire Anthropic usage, claudecosts.app is live and free — connect your Admin key, see daily spend by model. For broader strategies to keep agent costs in check beyond caching, see how to limit Claude agent costs.

Related guides

Prompt Caching: The 90% Discount Most Devs Miss — full implementation guide with Python code examples
How to Limit Claude Agent Costs — model tiering, Batch API, and other cost levers
Claude API Pricing 2026 — full price breakdown across all models

Frequently Asked Questions

How many times does a prompt need to be reused before caching saves money?

For the 5-minute cache, you need at least 2 reuses of the same prefix — the break-even is 1.28. For the 1-hour cache, you need at least 3–4 reuses because the write premium is 2x input price instead of 1.25x. Below those thresholds, you pay more with caching than without.

What is the minimum prefix size required for Claude prompt caching?

Haiku requires at least 1,024 tokens in the cached prefix. Sonnet and Opus require at least 2,048 tokens. Shorter prefixes are silently ignored by the API — the cache control is accepted but no caching occurs, and you pay normal input rates.

Does Claude prompt caching also reduce latency?

Yes. Cached reads typically respond 30–60% faster than uncached requests because the model does not need to re-process the prefix tokens. This latency benefit is in addition to the 90% cost reduction on cached input tokens.

Can I cache both my system prompt and tool definitions?

Yes. You can place cache_control: {"type": "ephemeral"} on up to 4 breakpoints — for example, one after the system prompt and one after the last tool definition. Everything before each breakpoint is cached. Tool schemas at 100–300 tokens each benefit significantly from caching at high request volumes.

Take It Further

Claude API Cost Optimization Masterclass — Cut your Claude API bill by 60–90% without sacrificing quality. 12 optimization scenarios analyzed. The concrete order-of-operations: prompt caching, model tiering, Batch API, token compression.

PDF guide + 6-sheet Excel cost calculator. Example scenario: $2,100 → $187/month on a customer support agent.

→ Get Cost Optimization Masterclass — $59

30-day money-back guarantee. Instant download.

Claude Prompt Caching: When It Pays Off (2026 Break-Even)

Claude prompt caching: when it pays off and when it doesn't (2026 numbers)

The pricing (April 2026)

The break-even formula

Six real workload examples

1. Support chatbot with 8K-token system prompt, 50 users/hr

2. One-shot code reviewer, 30K-token diff, 1 call per PR

3. RAG pipeline with 20K-token retrieved context, 1h cache

4. Agent with 15K-token tool manifest, 5m cache, long conversation

5. Batch classifier, 200 items, 10K-token instruction prefix

6. Evaluation harness, 40K-token rubric, 500 test cases

Common mistakes we made

Decision tree

See also

FAQ

Does the cache apply to system prompt only?

Can I have multiple cache points?

Does caching help latency?

Does Batch API stack with caching?

What about extended thinking?

Reproducing these numbers

Related guides

Frequently Asked Questions

How many times does a prompt need to be reused before caching saves money?

What is the minimum prefix size required for Claude prompt caching?

Does Claude prompt caching also reduce latency?

Can I cache both my system prompt and tool definitions?

Take It Further

Tools and references

Claude prompt caching: when it pays off and when it doesn't (2026 numbers)

The pricing (April 2026)

The break-even formula

Six real workload examples

1. Support chatbot with 8K-token system prompt, 50 users/hr

2. One-shot code reviewer, 30K-token diff, 1 call per PR

3. RAG pipeline with 20K-token retrieved context, 1h cache

4. Agent with 15K-token tool manifest, 5m cache, long conversation

5. Batch classifier, 200 items, 10K-token instruction prefix

6. Evaluation harness, 40K-token rubric, 500 test cases

Common mistakes we made

Decision tree

See also

FAQ

Does the cache apply to system prompt only?

Can I have multiple cache points?

Does caching help latency?

Does Batch API stack with caching?

What about extended thinking?

Reproducing these numbers

Related guides

Frequently Asked Questions

How many times does a prompt need to be reused before caching saves money?

What is the minimum prefix size required for Claude prompt caching?

Does Claude prompt caching also reduce latency?

Can I cache both my system prompt and tool definitions?

Take It Further

Related guides

Prompt Caching: The 90% Discount Most Claude Developers Miss

From $800 to $120/month: A Claude API Cost Optimization Case Study

Claude Agents in Production: Cost, Latency & Observability (2026)

Prompt Caching: $487→$52 in Real Numbers (2026)

Tools and references