← All guides

Claude API Cost Optimization: Complete Guide to Reducing Your Bill

Every technique for reducing Claude API costs — model routing, prompt caching, batch processing, token efficiency, and the cost monitoring architecture.

Claude API Cost Optimization: Complete Guide to Reducing Your Bill

Claude API costs can be reduced by 60–90% through a combination of four techniques: model routing (use Haiku for 40–60% of requests), prompt caching (eliminate repeated system prompt costs), batch processing (50% discount for non-real-time work), and token efficiency (reduce input size). Each technique is independent — implement them in order of impact for your specific usage pattern, starting with model routing.


Your cost breakdown (start here)

Before optimising, understand where your costs come from. Most applications have costs concentrated in a few places:

import anthropic
from collections import defaultdict

client = anthropic.Anthropic()
cost_tracker = defaultdict(lambda: {"input": 0, "output": 0, "requests": 0})

def tracked_create(request_type: str, **kwargs) -> anthropic.types.Message:
    """Wrapper that tracks costs by request type."""
    response = client.messages.create(**kwargs)
    
    model = kwargs.get("model", "unknown")
    cost_tracker[request_type]["input"] += response.usage.input_tokens
    cost_tracker[request_type]["output"] += response.usage.output_tokens
    cost_tracker[request_type]["requests"] += 1
    
    return response

def print_cost_report():
    """Print cost breakdown by request type."""
    rates = {
        "claude-haiku-4-5": (0.80, 4.00),
        "claude-sonnet-4-5": (3.00, 15.00),
        "claude-opus-4-0": (15.00, 75.00),
    }
    
    print("\n=== Cost Breakdown ===")
    for request_type, usage in cost_tracker.items():
        # Use Sonnet rate as default
        in_rate, out_rate = 3.00, 15.00
        cost = (usage["input"] * in_rate + usage["output"] * out_rate) / 1_000_000
        print(f"{request_type}: {usage['requests']} requests, ${cost:.4f} ({usage['input']}i/{usage['output']}o tokens)")

Run this for a week to identify your top cost drivers before optimising.


Technique 1: Model routing (typical savings: 40–60%)

The highest-leverage change is routing cheaper requests to Claude Haiku:

Task type Recommended model Savings vs always-Sonnet
Classification, routing, intent detection Haiku 75%
Data extraction from clear text Haiku 75%
Short generation (<200 tokens) with constraints Haiku 75%
Code generation, debugging Sonnet 0%
Complex analysis, multi-step reasoning Sonnet 0%
Research, highest-stakes tasks Opus -400%

Implementation: see the full model routing guide for the routing decision framework and code.

Quick win: if you have any classification or data extraction tasks currently on Sonnet, move them to Haiku today. This is almost always a zero-quality-impact change.


Technique 2: Prompt caching (typical savings: 30–70%)

If your requests include the same system prompt, document context, or example data, prompt caching eliminates 90% of those input token costs on subsequent requests.

How it works: add cache_control to the content blocks you want cached. The cache read cost is $0.30/M for Sonnet (vs $3.00/M standard), a 90% reduction.

# System prompt cached across all requests
response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": your_long_system_prompt,  # 500-2000 token system prompt
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[{"role": "user", "content": user_message}]
)

When caching has maximum impact:

Cache TTL: 5 minutes (resets on each cache hit). You pay the full input cost on the first request; subsequent requests within 5 minutes use cache read pricing.

For the complete implementation guide, see Claude Prompt Caching: Complete Guide.


Technique 3: Batch API (50% discount for non-real-time)

Any task that doesn't need real-time results should use the Batches API:

# Create a batch instead of individual requests
batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": f"item-{i}",
            "params": {
                "model": "claude-sonnet-4-5",
                "max_tokens": 1024,
                "messages": [{"role": "user", "content": item_text}],
            }
        }
        for i, item_text in enumerate(items_to_process)
    ]
)
# Results available in 1–24 hours at 50% of standard pricing

Good candidates for batching: document processing, content moderation, data enrichment, weekly reports, bulk extraction from historical data.

Not batchable: user-facing chat, real-time classification, anything where latency matters.

For the full Batches API guide, see the Anthropic Batch API guide.


Technique 4: Token efficiency (typical savings: 20–40%)

Reducing input token count cuts costs proportionally. The main levers:

Shorter system prompts

Every 1,000 tokens of system prompt × 1,000 requests/day = 1M tokens/day = $3/day for Sonnet. At scale, trimming system prompts matters.

# Verbose (2,000 tokens):
system = """You are a helpful, knowledgeable, and friendly AI assistant. 
Your goal is to provide accurate, helpful, and thoughtful responses to users. 
You should always be polite and considerate. You should avoid making mistakes 
and should double-check your work. You should be concise when appropriate and 
detailed when the user needs more information..."""

# Concise (200 tokens):  
system = """You are a technical support assistant. Answer concisely. 
Flag uncertainty explicitly. Escalate billing questions."""

Truncate conversation history

Long conversation histories are expensive. Keep only the last N turns:

def trim_conversation(messages: list[dict], max_turns: int = 10) -> list[dict]:
    """Keep only the last max_turns of conversation."""
    return messages[-(max_turns * 2):]  # * 2 for user+assistant pairs

Compress context with Haiku

For large documents, use Haiku to extract the relevant section before sending to Sonnet:

# Extract relevant section with Haiku (cheap)
relevant = haiku_client.messages.create(
    model="claude-haiku-4-5",
    max_tokens=2048,
    messages=[{
        "role": "user",
        "content": f"Extract only the section relevant to: {query}\n\n{large_document}"
    }]
).content[0].text

# Answer from extracted section with Sonnet (expensive but on smaller input)
answer = sonnet_client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[{"role": "user", "content": f"{query}\n\nContext:\n{relevant}"}]
).content[0].text

Stacking techniques: worked example

An application processing 10,000 documents per day, currently spending $150/day:

Current state (all Sonnet, no caching, real-time):

That seems high — let's audit: most requests are document extraction (no complex reasoning needed).

After optimisation:

  1. Model routing: move 70% to Haiku (extraction tasks) → saves $107/day
  2. Prompt caching: cache the 1,000-token system prompt → saves $30/day on cached inputs
  3. Batch processing: 80% of requests can tolerate 2-hour delay → batch at 50% → saves $67/day
  4. Token trimming: reduce average input by 300 tokens → saves $9/day

Result: $285/day → $72/day — 75% reduction.


Cost monitoring (prevent surprises)

Set up spend alerts in the Anthropic console:

  1. Console → Settings → Usage Limits
  2. Set a monthly budget alert at 80% of your target
  3. Set a hard stop limit at 120% of target (prevents runaway bills)

For programmatic monitoring:

# Check current spend via usage API
import anthropic

client = anthropic.Anthropic()

# Get usage for current billing period
# Note: usage endpoint available via console.anthropic.com/settings/usage
# or via the API — check current SDK docs for the exact endpoint

Frequently asked questions

What's the single biggest cost reduction for a new application? Model routing — moving classification and extraction tasks to Haiku. For most applications, 40–60% of requests qualify for Haiku, saving 75% on those requests. This alone cuts total cost by 30–45%.

Does prompt caching affect output quality? No. Cached input is processed identically to non-cached input. The only difference is cost and latency (cache reads are slightly faster).

Can I combine model routing with prompt caching? Yes, and you should. Route each request to the right model tier, then cache the system prompt on each tier separately. Haiku caching is even cheaper than Sonnet caching.

What's the minimum request volume before optimisation is worth the effort? Model routing is worthwhile at 100+ requests/day — the code changes are minimal and the savings are immediate. Prompt caching is worthwhile when you have a system prompt over 500 tokens used across many requests. Batch processing is worthwhile when you have any significant volume of non-real-time tasks.

Does using longer max_tokens increase cost even if the response is short? No. You're charged for actual tokens used, not max_tokens. Setting max_tokens=4096 on a request that produces a 200-token response costs the same as max_tokens=200.


Related guides


Take It Further

Claude API Cost Optimization Toolkit — The complete optimization system: all four techniques with production-ready code, the cost audit template, the routing decision framework, the Excel calculator that models your specific usage, and the monitoring architecture.

→ Get the Cost Optimization Toolkit — $59

30-day money-back guarantee. Instant download.

AI Disclosure: Drafted with Claude Code; all techniques validated in production as of April 2026.

Tools and references