How to Limit Claude Agent Costs: 7 Concrete Strategies
The fastest way to control Claude agent costs: route simple tasks to Haiku ($0.80/MTok input), cache repeated system prompts (up to 90% off cached tokens), and set a hard token budget per session that triggers a graceful shutdown. Combined, these three strategies alone can cut a typical agentic workload bill by 70–85% before touching anything else. The remaining five strategies below push further. For the full pricing table and model-by-model breakdown, see Claude API pricing 2026.
Why Agent Costs Can Spiral
A single-turn Claude API call is easy to cost-estimate: tokens in × price + tokens out × price. Agents are different.
In an agentic loop, every tool call appends more tokens to the conversation history. A 10-step agent run on Opus with a 50K-token context doesn't cost 10× a single call — it costs roughly 10 + 9 + 8 + ... + 1 = 55× the base input price for the accumulated context alone. Add tool outputs that include long JSON responses or file contents, and a single agent run can consume millions of tokens.
The compounding factors:
- Recursive tool calls: Each tool result is appended to context before the next call
- Large context windows: Passing full conversation history on every turn multiplies input costs
- Expensive model for all steps: Using Opus for a task that only needed Haiku
- Unstructured output: Claude asks clarifying questions instead of following a schema, adding extra round-trips
- No stopping criteria: An agent with no budget cap will keep calling tools until it runs out of ideas
The strategies below address each of these root causes directly.
Strategy 1: Model Routing
Savings potential: 60–90% on model cost
Claude's three production tiers as of April 2026:
| Model | Input price | Output price | Best for |
|---|---|---|---|
| claude-haiku-4-5 | $0.80/MTok | $4.00/MTok | Classification, extraction, simple Q&A |
| claude-sonnet-4-5 | $3.00/MTok | $15.00/MTok | Code generation, analysis, multi-step reasoning |
| claude-opus-4-5 | $15.00/MTok | $75.00/MTok | Complex judgment, novel problems, creative work |
Most production agents do 80% of their work at a Haiku level. Routing correctly means using Haiku by default and escalating only when needed. For a detailed comparison of when each model earns its cost, see Haiku vs Sonnet vs Opus: which model to use.
Routing heuristics:
def select_model(task_type: str, complexity_score: int) -> str:
"""
complexity_score: 0-10 estimate of task difficulty
"""
if task_type in ("classify", "extract", "summarize_short"):
return "claude-haiku-4-5"
elif complexity_score <= 5:
return "claude-sonnet-4-5"
else:
return "claude-opus-4-5"
A concrete example: a document processing agent that classifies documents, extracts fields, and flags edge cases for human review. Use Haiku for classification and extraction (fast, cheap, accurate for structured tasks), Sonnet for summarization and report generation, and Opus only for flagged edge cases requiring nuanced judgment.
Strategy 2: Prompt Caching
Savings potential: up to 90% on repeated context
Prompt caching lets Claude reuse previously processed tokens. Cached input tokens cost 10% of the standard input price on Anthropic's API. For agents with a large, stable system prompt — tool definitions, instructions, background context — caching that section means you pay full price once and 10% on every subsequent call.
To enable caching, add cache_control breakpoints to your messages:
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=[
{
"type": "text",
"text": YOUR_LARGE_SYSTEM_PROMPT, # 10K+ tokens
"cache_control": {"type": "ephemeral"} # cache this block
}
],
messages=conversation_history
)
What to cache:
- System prompt (tool definitions, instructions, persona)
- Static reference material (docs, schemas, codebases)
- The first N turns of a conversation that don't change (e.g., setup context)
What not to cache:
- The current user turn (changes every call)
- Tool results that change between calls
- Anything shorter than ~2,048 tokens (minimum for cache eligibility)
Cache entries are stored for 5 minutes and reset on each hit. For agents that call Claude repeatedly within a session, the system prompt will stay warm throughout.
Strategy 3: Budget Guardrails
Savings potential: prevents runaway costs entirely
A budget guardrail is a hard check that monitors token spend during an agent run and halts the agent gracefully before it exceeds your threshold.
Here is a complete Python implementation:
import anthropic
from dataclasses import dataclass, field
@dataclass
class BudgetGuardrail:
max_input_tokens: int = 500_000
max_output_tokens: int = 50_000
total_input_used: int = 0
total_output_used: int = 0
def check_usage(self, usage: anthropic.types.Usage) -> None:
self.total_input_used += usage.input_tokens
self.total_output_used += usage.output_tokens
@property
def is_over_budget(self) -> bool:
return (
self.total_input_used >= self.max_input_tokens
or self.total_output_used >= self.max_output_tokens
)
@property
def cost_usd(self) -> float:
"""Estimate cost for claude-sonnet-4-5"""
input_cost = (self.total_input_used / 1_000_000) * 3.00
output_cost = (self.total_output_used / 1_000_000) * 15.00
return input_cost + output_cost
def summary(self) -> str:
return (
f"Input: {self.total_input_used:,} tokens | "
f"Output: {self.total_output_used:,} tokens | "
f"Est. cost: ${self.cost_usd:.4f}"
)
def run_agent_with_budget(
client: anthropic.Anthropic,
initial_messages: list,
system_prompt: str,
tools: list,
budget: BudgetGuardrail,
) -> str:
messages = initial_messages.copy()
while True:
if budget.is_over_budget:
return f"[BUDGET EXCEEDED] {budget.summary()}\nAgent stopped before task completion."
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=4096,
system=system_prompt,
tools=tools,
messages=messages,
)
budget.check_usage(response.usage)
if response.stop_reason == "end_turn":
return response.content[0].text
if response.stop_reason == "tool_use":
# process tool calls, append results, continue loop
messages = _process_tool_calls(response, messages)
continue
break
return "[AGENT STOPPED UNEXPECTEDLY]"
Set max_input_tokens based on your acceptable cost ceiling. For Sonnet at $3/MTok input, a 500K input token limit means your worst-case input cost per run is $1.50.
Refinements:
- Emit a warning at 80% of budget ("approaching limit") and give Claude a chance to wrap up
- Log budget summaries to your observability system so you can tune thresholds over time
- Set separate budgets per agent role (orchestrator vs. subagent) if using multi-agent architecture
Strategy 4: Token Compression
Savings potential: 40–70% on long-running sessions
In an agentic loop, conversation history grows unboundedly. Every turn appends the previous assistant response plus tool results. After 20+ turns, you may be passing 100K+ tokens of history that contains mostly intermediate reasoning that Claude no longer needs.
Token compression works by replacing old history with a summary:
def compress_history(
client: anthropic.Anthropic,
messages: list,
keep_last_n: int = 4
) -> list:
"""
Summarize messages[:-keep_last_n] and return a compressed history.
Always keep the most recent N messages verbatim.
"""
if len(messages) <= keep_last_n + 2:
return messages # nothing to compress
to_summarize = messages[:-keep_last_n]
recent = messages[-keep_last_n:]
summary_response = client.messages.create(
model="claude-haiku-4-5", # cheap model for summarization
max_tokens=512,
messages=[
{
"role": "user",
"content": (
"Summarize the following conversation history in 3-5 bullet points. "
"Focus on decisions made, information gathered, and current state. "
"Be concise.\n\n"
+ "\n".join(
f"{m['role'].upper()}: {str(m['content'])[:500]}"
for m in to_summarize
)
)
}
]
)
summary_text = summary_response.content[0].text
compressed = [
{"role": "user", "content": f"[CONVERSATION SUMMARY]\n{summary_text}"},
{"role": "assistant", "content": "Understood. Continuing from this context."},
] + recent
return compressed
Run compression every N turns (e.g., every 10 turns) or when total history exceeds a token threshold. Use Haiku for the summarization call — it's 18× cheaper than Opus and perfectly capable of producing clean bullet-point summaries.
Strategy 5: Batch API
Savings potential: 50% on qualifying workloads
The Anthropic Batch API processes requests asynchronously and returns results within 24 hours. In exchange, you get a 50% discount on both input and output tokens. For a complete code walkthrough and polling patterns, see the Batch API guide.
import anthropic
client = anthropic.Anthropic()
# Create a batch of up to 10,000 requests
batch = client.messages.batches.create(
requests=[
{
"custom_id": f"request-{i}",
"params": {
"model": "claude-sonnet-4-5",
"max_tokens": 1024,
"messages": [{"role": "user", "content": task}]
}
}
for i, task in enumerate(task_list)
]
)
print(f"Batch ID: {batch.id}")
print(f"Processing {len(task_list)} requests at 50% discount")
Qualifying workloads:
- Document classification or data extraction pipelines
- Nightly report generation
- Bulk code review or linting
- Embeddings-adjacent tasks (entity extraction, tagging)
- A/B testing prompt variants
Not suitable for:
- Interactive user-facing features (requires real-time response)
- Tasks with dependencies between requests (batch requests are independent)
- Anything that needs results in under a few minutes
For a processing pipeline that currently costs $200/month, switching to Batch API where possible often brings that to $100–120/month with zero code changes beyond the API call pattern.
Strategy 6: Tool Call Limiting
Savings potential: prevents runaway loops
Agents can get stuck in loops — repeatedly calling the same tool with slightly different arguments, failing to make progress. Without a cap, this exhausts your budget on wasted work.
MAX_TOOL_CALLS = 25 # reasonable limit for most tasks
tool_call_count = 0
while response.stop_reason == "tool_use":
tool_call_count += 1
if tool_call_count >= MAX_TOOL_CALLS:
# Force the agent to stop and report what it has
messages.append({
"role": "user",
"content": (
f"You have made {tool_call_count} tool calls. "
"Stop using tools and give your best answer with the information you have."
)
})
response = client.messages.create(
model=model,
max_tokens=2048,
system=system_prompt,
messages=messages,
# Note: do NOT pass tools here — forces text response
)
break
# normal tool processing...
Calibrate the limit to your task type. Simple tasks might need ≤5 tool calls; complex research agents might legitimately need 30–50. Start conservative and adjust upward based on observed behavior.
Complementary: tool result truncation
If a tool returns a large payload (e.g., a 50K-character file), truncate it before appending to context:
MAX_TOOL_RESULT_CHARS = 5000
def truncate_tool_result(result: str) -> str:
if len(result) <= MAX_TOOL_RESULT_CHARS:
return result
return result[:MAX_TOOL_RESULT_CHARS] + f"\n[...truncated {len(result) - MAX_TOOL_RESULT_CHARS} chars]"
Strategy 7: Structured Output
Savings potential: 20–40% fewer back-and-forth turns
When Claude can return a machine-readable response on the first try, you eliminate the clarification loop — the pattern where an agent asks a vague question, gets a partial answer, asks a follow-up, and so on.
Use the tools parameter with a strict input schema even when you don't need a function call — just to force structured output:
EXTRACT_TOOL = {
"name": "extract_result",
"description": "Extract the structured result from your analysis",
"input_schema": {
"type": "object",
"properties": {
"status": {
"type": "string",
"enum": ["pass", "fail", "needs_review"]
},
"confidence": {
"type": "number",
"minimum": 0,
"maximum": 1
},
"summary": {
"type": "string",
"maxLength": 500
},
"issues": {
"type": "array",
"items": {"type": "string"}
}
},
"required": ["status", "confidence", "summary"]
}
}
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=1024,
tools=[EXTRACT_TOOL],
tool_choice={"type": "tool", "name": "extract_result"}, # force this tool
messages=[{"role": "user", "content": document_to_analyze}]
)
# Response will always be a structured tool call, never a rambling text response
result = response.content[0].input
tool_choice: {"type": "tool", "name": "..."} forces Claude to always call the specified tool rather than choosing to respond in text. This guarantees you get parseable output every time and eliminates the need for output parsing or retry logic.
Putting It Together: Cost Comparison
A document processing agent running 1,000 documents/month:
| Approach | Est. monthly cost |
|---|---|
| Unoptimized (Opus, no caching, no routing) | ~$850 |
| After model routing (Haiku for extraction) | ~$120 |
| + Prompt caching on system prompt | ~$85 |
| + Batch API for overnight processing | ~$43 |
| + Token compression + tool limiting | ~$35 |
Real numbers will vary — but an 8–10× reduction from a naive implementation to a fully optimized one is representative of production workloads.
FAQ
Q: Which strategy has the highest ROI for a new agent?
Start with model routing (Strategy 1) and prompt caching (Strategy 2). These require minimal code changes and consistently deliver the largest savings. Budget guardrails (Strategy 3) are also important to add early — they're your insurance against a bug causing runaway costs.
Q: Does prompt caching work for tool definitions?
Yes. Tool definitions can be large (hundreds of tokens each if you have detailed parameter descriptions). Including them in a cached system block instead of the tools array can reduce costs if you're making many calls with the same tools. Check the Anthropic caching documentation for the latest supported cache locations.
Q: What's the minimum token count for a prompt cache hit?
The minimum cacheable block is 1,024 tokens for Haiku and 2,048 tokens for Sonnet and Opus as of April 2026. Blocks shorter than the minimum are processed normally with no caching discount.
Q: Can I use the Batch API with streaming?
No. The Batch API is asynchronous and does not support streaming. Use the standard Messages API for any real-time or streaming use case.
Q: How do I track costs per agent run in production?
Use the usage object returned in every API response — it contains input_tokens, output_tokens, cache_creation_input_tokens, and cache_read_input_tokens. Sum these across all calls in a run and multiply by the per-token prices for your model. Log these to your observability system (Datadog, Grafana, etc.) tagged by agent type and session ID.
Sources
- Anthropic API pricing (April 2026): anthropic.com/pricing
- Prompt caching documentation: docs.anthropic.com/claude/prompt-caching
- Message Batches API: docs.anthropic.com/claude/message-batches
- Tool use documentation: docs.anthropic.com/claude/tool-use
- Claude Agent SDK overview: docs.anthropic.com/claude/agent-sdk
Related guides
- Prompt Caching: The 90% Discount Most Devs Miss — step-by-step implementation with cache_control
- Claude Batch API Guide — 50% discount for async workloads
- How to Ship 10x Faster with AI — workflow patterns that cut development time, not just API costs
Take It Further
Claude Agent SDK Cookbook: 40 Production Patterns — 40 battle-tested patterns for Claude agents in production. Retry logic, tool error recovery, parallel sub-agents, cost guardrails, deterministic testing.
167 pages. Complete, runnable Python and TypeScript code throughout.
→ Get Agent SDK Cookbook — $49
30-day money-back guarantee. Instant download.