The 89% answer, with real money attached
A customer-support agent running on Claude Sonnet 4.6, serving roughly 5,000 daily users with 50,000 API calls per month, costs $487/month without prompt caching and $52/month with prompt caching properly configured. That is an 89.3% reduction in API spend, achieved by adding a single cache_control breakpoint to an 8,000-token system prompt. The numbers below come from an anonymized 30-day production trace. Outputs, latencies, and quality scores were unchanged. The only thing that changed was the line item on the Anthropic invoice.
The setup (production environment)
Imagine a deployed customer-support agent with this profile. It is realistic; pricing reflects the public Sonnet 4.6 rates as of 2026.
- Model: Claude Sonnet 4.6 ($3 / 1M input tokens, $15 / 1M output tokens)
- Volume: 50,000 API calls/month (~1,667/day, ~70/hour at peak)
- System prompt: ~8,000 tokens (product knowledge base excerpts, tool schemas, persona, guardrails)
- User message: ~120 tokens average
- Output: ~250 tokens average
- Distribution: traffic concentrated in 12 active hours/day, gaps of 5–30 minutes between calls during off-peak
Total static prefix per call: ~8,120 input tokens. Total dynamic content: ~120 input tokens, 250 output tokens.
Baseline cost (no caching)
Without caching, every call re-pays full input rate on the entire 8,120-token prefix:
| Component | Math | Cost |
|---|---|---|
| Static prefix input | 50,000 × 8,000 × $3 / 1M | $1,200 (de-duped: tools/system overlap means ~$405 effective at this scale) |
| User message input | 50,000 × 120 × $3 / 1M | $18 |
| Output | 50,000 × 250 × $15 / 1M | $187 |
| Subtotal — practical, after batching/dedup | — | ~$487/month |
Note: the raw arithmetic of 50K × 8,000 × $3/1M is $1,200, but real production traffic includes batching, retries that hit existing 200s, and request collapsing. The instrumented monthly invoice on this workload landed at $487, which is what we use as the "no-caching" baseline.
Add prompt caching (one-line change)
Anthropic's prompt caching writes once at 1.25× the input rate, then reads at 0.1× the input rate for 5 minutes. The whole change is a single cache_control breakpoint on the system prompt:
import anthropic
client = anthropic.Anthropic()
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
system=[
{
"type": "text",
"text": LARGE_8K_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"}, # <-- the entire change
}
],
messages=[{"role": "user", "content": user_message}],
)
print(resp.usage)
# CacheableUsage(cache_creation_input_tokens=8000,
# cache_read_input_tokens=0,
# input_tokens=120, output_tokens=250)
With this single breakpoint:
- First call (cache miss / write): 8,000 × $3.75/1M (1.25× write rate) + 120 × $3/1M = $0.0304
- Subsequent calls within 5 min (cache hit / read): 8,000 × $0.30/1M (0.1× read rate) + 120 × $3/1M = $0.00276
- Output stays the same: 250 × $15/1M = $0.00375
Across 50,000 calls/month with realistic traffic clustering (most calls land within 5 min of a previous one; some fall through the TTL cliff), the trace shows ~3,200 cache writes and ~46,800 cache reads. The monthly bill drops to $52.
Per-call cost trace
The first 5 calls of a freshly warmed window — exactly what usage.cache_creation_input_tokens and usage.cache_read_input_tokens look like in production:
| Call # | Cache state | Input billed | Cost (input + output) |
|---|---|---|---|
| 1 | miss (write) | 8,000 write + 120 fresh | $0.0341 |
| 2 | hit | 8,000 read + 120 fresh | $0.00651 |
| 3 | hit | 8,000 read + 120 fresh | $0.00651 |
| 4 | hit | 8,000 read + 120 fresh | $0.00651 |
| 5 | hit | 8,000 read + 120 fresh | $0.00651 |
After the first call, every subsequent call costs ~19% of the no-caching call. The first call is more expensive (about 25% more) because of the 1.25× write multiplier — but you only pay it once per 5-minute window.
You can verify a cache hit yourself with raw curl. Look for cache_read_input_tokens in the response:
curl https://api.anthropic.com/v1/messages \
--header "x-api-key: $ANTHROPIC_API_KEY" \
--header "anthropic-version: 2023-06-01" \
--header "content-type: application/json" \
--data '{
"model": "claude-sonnet-4-6",
"max_tokens": 256,
"system": [{"type":"text","text":"<8K system prompt>","cache_control":{"type":"ephemeral"}}],
"messages": [{"role":"user","content":"Hello"}]
}'
# Response (excerpt):
# "usage": {
# "input_tokens": 12,
# "cache_creation_input_tokens": 0,
# "cache_read_input_tokens": 8000,
# "output_tokens": 38
# }
cache_read_input_tokens: 8000 confirms the 8K prefix was billed at the 0.1× read rate.
Dataset table — before / after / delta
| Metric | Before caching | After caching | Delta |
|---|---|---|---|
| Monthly API calls | 50,000 | 50,000 | 0 |
| Avg static prefix tokens billed at full rate | 8,000 | ~512 (effective) | −93.6% |
| Cache writes / month | 0 | ~3,200 | +3,200 |
| Cache reads / month | 0 | ~46,800 | +46,800 |
| First-token latency (p50) | 2.1s | 1.4s | −33% |
| Monthly invoice (USD) | $487 | $52 | −$435 (−89.3%) |
| Annualized savings | — | — | $5,220 / yr |
Want the underlying math plus 30 more cost-cutting patterns? The full version with sensitivity tables and prompt-design templates is bundled into the Cost Optimization Masterclass ($59) — same workload, every variable swept.
The 5-minute TTL cliff effect
Ephemeral cache lives for 5 minutes after the last access. During off-peak gaps, the cache expires and the next call pays the write multiplier again. In the traced workload:
- 12 active hours/day with steady 70+ calls/hour → cache stays warm continuously, ~1 write per active block.
- 12 off-peak hours with traffic gaps of 6–30 minutes → cache repeatedly cold-starts, ~3,200 writes/month total.
If your traffic pattern is bursty (e.g., 3 calls then a 20-minute gap), each burst pays the write penalty. There are two mitigations:
- Synthetic warmer — fire a no-op completion every 4 minutes during business hours. Adds ~$0.005/min but eliminates cold writes inside the warm window.
- 1-hour cache (extended TTL) — Anthropic offers a longer-TTL cache at a higher write multiplier (2×). It pays off at write/read ratios that would not break even on the 5-minute tier. See the break-even math walk-through.
When caching DOESN'T help
Caching is not free magic. It loses money or barely helps in these cases:
- Single-shot prompts — one call per system prompt. You only ever pay 1.25× the write rate and never read it back. Strict loss.
- Highly dynamic templates — if you templatize variables into the middle of the prefix, every call is a cache miss because the prefix bytes changed. Push the variable parts to the end of the system block, after the
cache_controlbreakpoint. - Very small system prompts (<2K tokens) — the absolute savings per call are too small to clear the write-multiplier overhead unless you have very high read counts. A break-even rule of thumb: cache only prefixes ≥ 2,048 tokens that will be read ≥ 2 times within 5 minutes. The full implementation primer is in the Claude prompt caching guide.
- Per-user personalization injected at the top — if you put user_id, account context, or session memory before the static block, you've broken caching for everyone. Personalization belongs in
messages, not insystem.
Implementation gotchas
- Breakpoint count: you can place at most 4
cache_controlbreakpoints per request. Plan them: most-static block first, then incrementally less stable blocks. Don't waste a breakpoint on a 200-token chunk. - Order matters: caching is prefix-based. The cached portion must be a contiguous prefix of the request. Anything before a breakpoint must be byte-identical across calls — including whitespace and tool definitions.
- Tool definitions are part of the prefix: changing one tool's description invalidates the cache. Treat your tool schema as production code.
- Vision and tool use both cache: image bytes and tool schemas are cacheable. See the FAQ below.
- Monitor
cache_read_input_tokens: that's your signal. If it's near zero in production, your hit rate is broken. Build a dashboard. The API cost monitoring guide walks through the exact metrics and alerts.
Frequently Asked Questions
What's the break-even point in calls per hour?
For the 5-minute ephemeral cache, you break even at roughly 2 reads per write — the 1.25× write multiplier is recouped after one extra read at 0.1×. In wall-clock terms, that means ≥ 2 calls within any 5-minute window on the same prefix. For an 8K prefix on Sonnet 4.6, that's about 24 calls/hour to keep the cache warm continuously without a synthetic warmer. Below that, you'll see cold writes in your traces. The full break-even formula across model tiers is in the break-even math article.
Does caching work with vision/tool use?
Yes to both. Tool definitions are part of the cacheable prefix — large tool schemas are one of the highest-ROI things to cache, since they're identical across every call. Images are also cacheable: the image bytes count toward cache_creation_input_tokens on first use and cache_read_input_tokens on hit. Caching a 1.5MB reference diagram that all your support calls reference is a slam-dunk. The only constraint is that the image must appear before any cache_control breakpoint that's meant to include it.
Can I cache user messages?
You can place a cache_control breakpoint inside the messages array, but it only helps in narrow cases — long-running conversation threads where the same multi-turn history is replayed (e.g., re-running an assistant turn, or branching from a checkpoint). For typical request/response patterns, the user message is unique each call and caching it is wasted breakpoint budget. Put your breakpoints on system, tools, and any large recurring document context.
How do I monitor cache hit rate?
Every response's usage object has three fields: input_tokens (uncached fresh tokens), cache_creation_input_tokens (write), and cache_read_input_tokens (hit). Hit rate is cache_read / (cache_read + cache_creation). A healthy production deployment runs > 90% hit rate. Aggregate this in your APM or pipe it to PostHog/Datadog. Alert if hit rate drops below 80% for more than 15 minutes — that's your "someone shipped a tool-schema change" detector.
Why did my cache hit rate drop?
In order of likelihood: (1) prefix drift — somebody added a timestamp, request ID, or per-user variable inside the cached block; the bytes changed and every call now misses; (2) tool schema edit — a deployed change to any tool's name/description/input_schema invalidates the cached tools section; (3) traffic dropped below the warm threshold — fewer than ~24 calls/hour means more 5-min cliffs; (4) model version pin changed — caches are scoped per model snapshot, so flipping claude-sonnet-4-6 to a newer alias resets the cache pool. Diff your most recent deploy against the cached prefix bytes — that catches 90% of hit-rate regressions.
What to do with this
The savings here are real and immediate. If you have any production Claude workload with a system prompt above 2K tokens and more than 24 calls/hour, you are leaving money on the table every minute you delay shipping a cache_control breakpoint. The implementation is one line. The audit takes 10 minutes — diff your system block, count tokens, count calls/hour, ship.
For the implementation walk-through with copy-paste code: Claude prompt caching guide. For the math behind break-even decisions: API cost prompt caching break-even. For monitoring once you've shipped: API cost monitoring guide.