Claude API Cost Optimization: Complete Guide to Reducing Your Bill

Q: What's the single biggest cost reduction for a new application?

Model routing — moving classification and extraction tasks to Haiku. For most applications, 40–60% of requests qualify for Haiku, saving 75% on those requests. This alone cuts total cost by 30–45%.

Q: Does prompt caching affect output quality?

No. Cached input is processed identically to non-cached input. The only difference is cost and latency (cache reads are slightly faster).

Q: Can I combine model routing with prompt caching?

Yes, and you should. Route each request to the right model tier, then cache the system prompt on each tier separately. Haiku caching is even cheaper than Sonnet caching.

Q: What's the minimum request volume before optimisation is worth the effort?

Model routing is worthwhile at 100+ requests/day — the code changes are minimal and the savings are immediate. Prompt caching is worthwhile when you have a system prompt over 500 tokens used across many requests. Batch processing is worthwhile when you have any significant volume of non-real-time tasks.

Q: Does using longer `max_tokens` increase cost even if the response is short?

No. You're charged for actual tokens used, not max_tokens. Setting max_tokens=4096 on a request that produces a 200-token response costs the same as max_tokens=200.

Claude API costs can be reduced by 60–90% through a combination of four techniques: model routing (use Haiku for 40–60% of requests), prompt caching (eliminate repeated system prompt costs), batch processing (50% discount for non-real-time work), and token efficiency (reduce input size). Each technique is independent — implement them in order of impact for your specific usage pattern, starting with model routing.

Your cost breakdown (start here)

Before optimising, understand where your costs come from. Most applications have costs concentrated in a few places:

import anthropic
from collections import defaultdict

client = anthropic.Anthropic()
cost_tracker = defaultdict(lambda: {"input": 0, "output": 0, "requests": 0})

def tracked_create(request_type: str, **kwargs) -> anthropic.types.Message:
    """Wrapper that tracks costs by request type."""
    response = client.messages.create(**kwargs)
    
    model = kwargs.get("model", "unknown")
    cost_tracker[request_type]["input"] += response.usage.input_tokens
    cost_tracker[request_type]["output"] += response.usage.output_tokens
    cost_tracker[request_type]["requests"] += 1
    
    return response

def print_cost_report():
    """Print cost breakdown by request type."""
    rates = {
        "claude-haiku-4-5": (0.80, 4.00),
        "claude-sonnet-4-5": (3.00, 15.00),
        "claude-opus-4-0": (15.00, 75.00),
    }
    
    print("\n=== Cost Breakdown ===")
    for request_type, usage in cost_tracker.items():
        # Use Sonnet rate as default
        in_rate, out_rate = 3.00, 15.00
        cost = (usage["input"] * in_rate + usage["output"] * out_rate) / 1_000_000
        print(f"{request_type}: {usage['requests']} requests, ${cost:.4f} ({usage['input']}i/{usage['output']}o tokens)")

Run this for a week to identify your top cost drivers before optimising.

Technique 1: Model routing (typical savings: 40–60%)

The highest-leverage change is routing cheaper requests to Claude Haiku:

Task type	Recommended model	Savings vs always-Sonnet
Classification, routing, intent detection	Haiku	75%
Data extraction from clear text	Haiku	75%
Short generation (<200 tokens) with constraints	Haiku	75%
Code generation, debugging	Sonnet	0%
Complex analysis, multi-step reasoning	Sonnet	0%
Research, highest-stakes tasks	Opus	-400%

Implementation: see the full model routing guide for the routing decision framework and code.

Quick win: if you have any classification or data extraction tasks currently on Sonnet, move them to Haiku today. This is almost always a zero-quality-impact change.

Technique 2: Prompt caching (typical savings: 30–70%)

If your requests include the same system prompt, document context, or example data, prompt caching eliminates 90% of those input token costs on subsequent requests.

How it works: add cache_control to the content blocks you want cached. The cache read cost is $0.30/M for Sonnet (vs $3.00/M standard), a 90% reduction.

# System prompt cached across all requests
response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": your_long_system_prompt,  # 500-2000 token system prompt
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[{"role": "user", "content": user_message}]
)

When caching has maximum impact:

Long system prompts (500+ tokens) used across many requests
Document context (a product manual, codebase, knowledge base) used across many requests
Few-shot examples in the prompt

Cache TTL: 5 minutes (resets on each cache hit). You pay the full input cost on the first request; subsequent requests within 5 minutes use cache read pricing.

For the complete implementation guide, see Claude Prompt Caching: Complete Guide.

Technique 3: Batch API (50% discount for non-real-time)

Any task that doesn't need real-time results should use the Batches API:

# Create a batch instead of individual requests
batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": f"item-{i}",
            "params": {
                "model": "claude-sonnet-4-5",
                "max_tokens": 1024,
                "messages": [{"role": "user", "content": item_text}],
            }
        }
        for i, item_text in enumerate(items_to_process)
    ]
)
# Results available in 1–24 hours at 50% of standard pricing

Good candidates for batching: document processing, content moderation, data enrichment, weekly reports, bulk extraction from historical data.

Not batchable: user-facing chat, real-time classification, anything where latency matters.

For the full Batches API guide, see the Anthropic Batch API guide.

Technique 4: Token efficiency (typical savings: 20–40%)

Reducing input token count cuts costs proportionally. The main levers:

Shorter system prompts

Every 1,000 tokens of system prompt × 1,000 requests/day = 1M tokens/day = $3/day for Sonnet. At scale, trimming system prompts matters.

# Verbose (2,000 tokens):
system = """You are a helpful, knowledgeable, and friendly AI assistant. 
Your goal is to provide accurate, helpful, and thoughtful responses to users. 
You should always be polite and considerate. You should avoid making mistakes 
and should double-check your work. You should be concise when appropriate and 
detailed when the user needs more information..."""

# Concise (200 tokens):  
system = """You are a technical support assistant. Answer concisely. 
Flag uncertainty explicitly. Escalate billing questions."""

Truncate conversation history

Long conversation histories are expensive. Keep only the last N turns:

def trim_conversation(messages: list[dict], max_turns: int = 10) -> list[dict]:
    """Keep only the last max_turns of conversation."""
    return messages[-(max_turns * 2):]  # * 2 for user+assistant pairs

Compress context with Haiku

For large documents, use Haiku to extract the relevant section before sending to Sonnet:

# Extract relevant section with Haiku (cheap)
relevant = haiku_client.messages.create(
    model="claude-haiku-4-5",
    max_tokens=2048,
    messages=[{
        "role": "user",
        "content": f"Extract only the section relevant to: {query}\n\n{large_document}"
    }]
).content[0].text

# Answer from extracted section with Sonnet (expensive but on smaller input)
answer = sonnet_client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[{"role": "user", "content": f"{query}\n\nContext:\n{relevant}"}]
).content[0].text

Stacking techniques: worked example

An application processing 10,000 documents per day, currently spending $150/day:

Current state (all Sonnet, no caching, real-time):

10,000 requests × 2,000 avg input tokens × $3/M = $60/day input
10,000 requests × 1,500 avg output tokens × $15/M = $225/day output
Subtotal: $285/day

That seems high — let's audit: most requests are document extraction (no complex reasoning needed).

After optimisation:

Model routing: move 70% to Haiku (extraction tasks) → saves $107/day
Prompt caching: cache the 1,000-token system prompt → saves $30/day on cached inputs
Batch processing: 80% of requests can tolerate 2-hour delay → batch at 50% → saves $67/day
Token trimming: reduce average input by 300 tokens → saves $9/day

Result: $285/day → $72/day — 75% reduction.

Cost monitoring (prevent surprises)

Set up spend alerts in the Anthropic console:

Console → Settings → Usage Limits
Set a monthly budget alert at 80% of your target
Set a hard stop limit at 120% of target (prevents runaway bills)

For programmatic monitoring:

# Check current spend via usage API
import anthropic

client = anthropic.Anthropic()

# Get usage for current billing period
# Note: usage endpoint available via console.anthropic.com/settings/usage
# or via the API — check current SDK docs for the exact endpoint

Frequently asked questions

What's the single biggest cost reduction for a new application? Model routing — moving classification and extraction tasks to Haiku. For most applications, 40–60% of requests qualify for Haiku, saving 75% on those requests. This alone cuts total cost by 30–45%.

Does prompt caching affect output quality? No. Cached input is processed identically to non-cached input. The only difference is cost and latency (cache reads are slightly faster).

Can I combine model routing with prompt caching? Yes, and you should. Route each request to the right model tier, then cache the system prompt on each tier separately. Haiku caching is even cheaper than Sonnet caching.

What's the minimum request volume before optimisation is worth the effort? Model routing is worthwhile at 100+ requests/day — the code changes are minimal and the savings are immediate. Prompt caching is worthwhile when you have a system prompt over 500 tokens used across many requests. Batch processing is worthwhile when you have any significant volume of non-real-time tasks.

Does using longer max_tokens increase cost even if the response is short? No. You're charged for actual tokens used, not max_tokens. Setting max_tokens=4096 on a request that produces a 200-token response costs the same as max_tokens=200.

Related guides

Claude Model Routing: When to Use Haiku, Sonnet, or Opus — routing implementation
Claude Prompt Caching: Complete Guide to 90% Cost Reduction — caching implementation
Anthropic Message Batches API: 50% Cost Reduction for Bulk Processing — batch implementation

Take It Further

Claude API Cost Optimization Toolkit — The complete optimization system: all four techniques with production-ready code, the cost audit template, the routing decision framework, the Excel calculator that models your specific usage, and the monitoring architecture.

→ Get the Cost Optimization Toolkit — $59

30-day money-back guarantee. Instant download.

Claude API Cost Optimization: Complete Guide to Reducing Your Bill

Your cost breakdown (start here)

Technique 1: Model routing (typical savings: 40–60%)

Technique 2: Prompt caching (typical savings: 30–70%)

Technique 3: Batch API (50% discount for non-real-time)

Technique 4: Token efficiency (typical savings: 20–40%)

Shorter system prompts

Truncate conversation history

Compress context with Haiku

Stacking techniques: worked example

Cost monitoring (prevent surprises)

Frequently asked questions

Related guides

Take It Further

Related guides

10 Claude API Cost Quick Wins: 5-30 Minute Fixes (2026)

How to Limit Claude Agent Costs: 7 Concrete Strategies

Prompt Caching: The 90% Discount Most Claude Developers Miss

Claude Batch API: 50% Discount for Async Workloads (2026 Guide)

Tools and references