← All guides

How to Limit Claude Agent Costs: 7 Concrete Strategies

7 concrete strategies to limit Claude agent costs in production — model routing, prompt caching, budget guardrails, token compression.

How to Limit Claude Agent Costs: 7 Concrete Strategies

The fastest way to control Claude agent costs: route simple tasks to Haiku ($0.80/MTok input), cache repeated system prompts (up to 90% off cached tokens), and set a hard token budget per session that triggers a graceful shutdown. Combined, these three strategies alone can cut a typical agentic workload bill by 70–85% before touching anything else. The remaining five strategies below push further. For the full pricing table and model-by-model breakdown, see Claude API pricing 2026.


Why Agent Costs Can Spiral

A single-turn Claude API call is easy to cost-estimate: tokens in × price + tokens out × price. Agents are different.

In an agentic loop, every tool call appends more tokens to the conversation history. A 10-step agent run on Opus with a 50K-token context doesn't cost 10× a single call — it costs roughly 10 + 9 + 8 + ... + 1 = 55× the base input price for the accumulated context alone. Add tool outputs that include long JSON responses or file contents, and a single agent run can consume millions of tokens.

The compounding factors:

The strategies below address each of these root causes directly.


Strategy 1: Model Routing

Savings potential: 60–90% on model cost

Claude's three production tiers as of April 2026:

Model Input price Output price Best for
claude-haiku-4-5 $0.80/MTok $4.00/MTok Classification, extraction, simple Q&A
claude-sonnet-4-5 $3.00/MTok $15.00/MTok Code generation, analysis, multi-step reasoning
claude-opus-4-5 $15.00/MTok $75.00/MTok Complex judgment, novel problems, creative work

Most production agents do 80% of their work at a Haiku level. Routing correctly means using Haiku by default and escalating only when needed. For a detailed comparison of when each model earns its cost, see Haiku vs Sonnet vs Opus: which model to use.

Routing heuristics:

def select_model(task_type: str, complexity_score: int) -> str:
    """
    complexity_score: 0-10 estimate of task difficulty
    """
    if task_type in ("classify", "extract", "summarize_short"):
        return "claude-haiku-4-5"
    elif complexity_score <= 5:
        return "claude-sonnet-4-5"
    else:
        return "claude-opus-4-5"

A concrete example: a document processing agent that classifies documents, extracts fields, and flags edge cases for human review. Use Haiku for classification and extraction (fast, cheap, accurate for structured tasks), Sonnet for summarization and report generation, and Opus only for flagged edge cases requiring nuanced judgment.


Strategy 2: Prompt Caching

Savings potential: up to 90% on repeated context

Prompt caching lets Claude reuse previously processed tokens. Cached input tokens cost 10% of the standard input price on Anthropic's API. For agents with a large, stable system prompt — tool definitions, instructions, background context — caching that section means you pay full price once and 10% on every subsequent call.

To enable caching, add cache_control breakpoints to your messages:

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": YOUR_LARGE_SYSTEM_PROMPT,  # 10K+ tokens
            "cache_control": {"type": "ephemeral"}  # cache this block
        }
    ],
    messages=conversation_history
)

What to cache:

What not to cache:

Cache entries are stored for 5 minutes and reset on each hit. For agents that call Claude repeatedly within a session, the system prompt will stay warm throughout.


Strategy 3: Budget Guardrails

Savings potential: prevents runaway costs entirely

A budget guardrail is a hard check that monitors token spend during an agent run and halts the agent gracefully before it exceeds your threshold.

Here is a complete Python implementation:

import anthropic
from dataclasses import dataclass, field

@dataclass
class BudgetGuardrail:
    max_input_tokens: int = 500_000
    max_output_tokens: int = 50_000
    total_input_used: int = 0
    total_output_used: int = 0

    def check_usage(self, usage: anthropic.types.Usage) -> None:
        self.total_input_used += usage.input_tokens
        self.total_output_used += usage.output_tokens

    @property
    def is_over_budget(self) -> bool:
        return (
            self.total_input_used >= self.max_input_tokens
            or self.total_output_used >= self.max_output_tokens
        )

    @property
    def cost_usd(self) -> float:
        """Estimate cost for claude-sonnet-4-5"""
        input_cost = (self.total_input_used / 1_000_000) * 3.00
        output_cost = (self.total_output_used / 1_000_000) * 15.00
        return input_cost + output_cost

    def summary(self) -> str:
        return (
            f"Input: {self.total_input_used:,} tokens | "
            f"Output: {self.total_output_used:,} tokens | "
            f"Est. cost: ${self.cost_usd:.4f}"
        )


def run_agent_with_budget(
    client: anthropic.Anthropic,
    initial_messages: list,
    system_prompt: str,
    tools: list,
    budget: BudgetGuardrail,
) -> str:
    messages = initial_messages.copy()

    while True:
        if budget.is_over_budget:
            return f"[BUDGET EXCEEDED] {budget.summary()}\nAgent stopped before task completion."

        response = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=4096,
            system=system_prompt,
            tools=tools,
            messages=messages,
        )

        budget.check_usage(response.usage)

        if response.stop_reason == "end_turn":
            return response.content[0].text

        if response.stop_reason == "tool_use":
            # process tool calls, append results, continue loop
            messages = _process_tool_calls(response, messages)
            continue

        break

    return "[AGENT STOPPED UNEXPECTEDLY]"

Set max_input_tokens based on your acceptable cost ceiling. For Sonnet at $3/MTok input, a 500K input token limit means your worst-case input cost per run is $1.50.

Refinements:


Strategy 4: Token Compression

Savings potential: 40–70% on long-running sessions

In an agentic loop, conversation history grows unboundedly. Every turn appends the previous assistant response plus tool results. After 20+ turns, you may be passing 100K+ tokens of history that contains mostly intermediate reasoning that Claude no longer needs.

Token compression works by replacing old history with a summary:

def compress_history(
    client: anthropic.Anthropic,
    messages: list,
    keep_last_n: int = 4
) -> list:
    """
    Summarize messages[:-keep_last_n] and return a compressed history.
    Always keep the most recent N messages verbatim.
    """
    if len(messages) <= keep_last_n + 2:
        return messages  # nothing to compress

    to_summarize = messages[:-keep_last_n]
    recent = messages[-keep_last_n:]

    summary_response = client.messages.create(
        model="claude-haiku-4-5",  # cheap model for summarization
        max_tokens=512,
        messages=[
            {
                "role": "user",
                "content": (
                    "Summarize the following conversation history in 3-5 bullet points. "
                    "Focus on decisions made, information gathered, and current state. "
                    "Be concise.\n\n"
                    + "\n".join(
                        f"{m['role'].upper()}: {str(m['content'])[:500]}"
                        for m in to_summarize
                    )
                )
            }
        ]
    )

    summary_text = summary_response.content[0].text
    compressed = [
        {"role": "user", "content": f"[CONVERSATION SUMMARY]\n{summary_text}"},
        {"role": "assistant", "content": "Understood. Continuing from this context."},
    ] + recent

    return compressed

Run compression every N turns (e.g., every 10 turns) or when total history exceeds a token threshold. Use Haiku for the summarization call — it's 18× cheaper than Opus and perfectly capable of producing clean bullet-point summaries.


Strategy 5: Batch API

Savings potential: 50% on qualifying workloads

The Anthropic Batch API processes requests asynchronously and returns results within 24 hours. In exchange, you get a 50% discount on both input and output tokens. For a complete code walkthrough and polling patterns, see the Batch API guide.

import anthropic

client = anthropic.Anthropic()

# Create a batch of up to 10,000 requests
batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": f"request-{i}",
            "params": {
                "model": "claude-sonnet-4-5",
                "max_tokens": 1024,
                "messages": [{"role": "user", "content": task}]
            }
        }
        for i, task in enumerate(task_list)
    ]
)

print(f"Batch ID: {batch.id}")
print(f"Processing {len(task_list)} requests at 50% discount")

Qualifying workloads:

Not suitable for:

For a processing pipeline that currently costs $200/month, switching to Batch API where possible often brings that to $100–120/month with zero code changes beyond the API call pattern.


Strategy 6: Tool Call Limiting

Savings potential: prevents runaway loops

Agents can get stuck in loops — repeatedly calling the same tool with slightly different arguments, failing to make progress. Without a cap, this exhausts your budget on wasted work.

MAX_TOOL_CALLS = 25  # reasonable limit for most tasks

tool_call_count = 0

while response.stop_reason == "tool_use":
    tool_call_count += 1

    if tool_call_count >= MAX_TOOL_CALLS:
        # Force the agent to stop and report what it has
        messages.append({
            "role": "user",
            "content": (
                f"You have made {tool_call_count} tool calls. "
                "Stop using tools and give your best answer with the information you have."
            )
        })
        response = client.messages.create(
            model=model,
            max_tokens=2048,
            system=system_prompt,
            messages=messages,
            # Note: do NOT pass tools here — forces text response
        )
        break

    # normal tool processing...

Calibrate the limit to your task type. Simple tasks might need ≤5 tool calls; complex research agents might legitimately need 30–50. Start conservative and adjust upward based on observed behavior.

Complementary: tool result truncation
If a tool returns a large payload (e.g., a 50K-character file), truncate it before appending to context:

MAX_TOOL_RESULT_CHARS = 5000

def truncate_tool_result(result: str) -> str:
    if len(result) <= MAX_TOOL_RESULT_CHARS:
        return result
    return result[:MAX_TOOL_RESULT_CHARS] + f"\n[...truncated {len(result) - MAX_TOOL_RESULT_CHARS} chars]"

Strategy 7: Structured Output

Savings potential: 20–40% fewer back-and-forth turns

When Claude can return a machine-readable response on the first try, you eliminate the clarification loop — the pattern where an agent asks a vague question, gets a partial answer, asks a follow-up, and so on.

Use the tools parameter with a strict input schema even when you don't need a function call — just to force structured output:

EXTRACT_TOOL = {
    "name": "extract_result",
    "description": "Extract the structured result from your analysis",
    "input_schema": {
        "type": "object",
        "properties": {
            "status": {
                "type": "string",
                "enum": ["pass", "fail", "needs_review"]
            },
            "confidence": {
                "type": "number",
                "minimum": 0,
                "maximum": 1
            },
            "summary": {
                "type": "string",
                "maxLength": 500
            },
            "issues": {
                "type": "array",
                "items": {"type": "string"}
            }
        },
        "required": ["status", "confidence", "summary"]
    }
}

response = client.messages.create(
    model="claude-haiku-4-5",
    max_tokens=1024,
    tools=[EXTRACT_TOOL],
    tool_choice={"type": "tool", "name": "extract_result"},  # force this tool
    messages=[{"role": "user", "content": document_to_analyze}]
)

# Response will always be a structured tool call, never a rambling text response
result = response.content[0].input

tool_choice: {"type": "tool", "name": "..."} forces Claude to always call the specified tool rather than choosing to respond in text. This guarantees you get parseable output every time and eliminates the need for output parsing or retry logic.


Putting It Together: Cost Comparison

A document processing agent running 1,000 documents/month:

Approach Est. monthly cost
Unoptimized (Opus, no caching, no routing) ~$850
After model routing (Haiku for extraction) ~$120
+ Prompt caching on system prompt ~$85
+ Batch API for overnight processing ~$43
+ Token compression + tool limiting ~$35

Real numbers will vary — but an 8–10× reduction from a naive implementation to a fully optimized one is representative of production workloads.


FAQ

Q: Which strategy has the highest ROI for a new agent?
Start with model routing (Strategy 1) and prompt caching (Strategy 2). These require minimal code changes and consistently deliver the largest savings. Budget guardrails (Strategy 3) are also important to add early — they're your insurance against a bug causing runaway costs.

Q: Does prompt caching work for tool definitions?
Yes. Tool definitions can be large (hundreds of tokens each if you have detailed parameter descriptions). Including them in a cached system block instead of the tools array can reduce costs if you're making many calls with the same tools. Check the Anthropic caching documentation for the latest supported cache locations.

Q: What's the minimum token count for a prompt cache hit?
The minimum cacheable block is 1,024 tokens for Haiku and 2,048 tokens for Sonnet and Opus as of April 2026. Blocks shorter than the minimum are processed normally with no caching discount.

Q: Can I use the Batch API with streaming?
No. The Batch API is asynchronous and does not support streaming. Use the standard Messages API for any real-time or streaming use case.

Q: How do I track costs per agent run in production?
Use the usage object returned in every API response — it contains input_tokens, output_tokens, cache_creation_input_tokens, and cache_read_input_tokens. Sum these across all calls in a run and multiply by the per-token prices for your model. Log these to your observability system (Datadog, Grafana, etc.) tagged by agent type and session ID.


Sources

Related guides


Take It Further

Claude Agent SDK Cookbook: 40 Production Patterns — 40 battle-tested patterns for Claude agents in production. Retry logic, tool error recovery, parallel sub-agents, cost guardrails, deterministic testing.

167 pages. Complete, runnable Python and TypeScript code throughout.

→ Get Agent SDK Cookbook — $49

30-day money-back guarantee. Instant download.

AI Disclosure: Drafted with Claude Code; all pricing and feature details from official documentation as of April 2026.

Tools and references