Claude Extended Thinking: When Opus Is Worth the Extra Cost

Use extended thinking when accuracy on a complex, multi-step problem matters more than cost — specifically for advanced math, algorithmic design, legal or financial analysis, and architectural decisions where reasoning depth directly changes the quality of the output. For simple lookups, classification, and routine code generation, turn it off.

What Is Extended Thinking?

Extended thinking gives Claude a private scratchpad. Before producing a final response, the model works through the problem step by step — exploring alternatives, catching contradictions, and verifying its own logic — in a hidden reasoning block that you never see in the output. The result is a more deliberate answer, at the cost of more tokens.

The feature is available on Claude Opus 4 and Claude Sonnet 4 via the thinking parameter in the Messages API. You control the maximum thinking budget with budget_tokens. Thinking tokens are billed at the same rate as regular output tokens, so cost scales directly with how much reasoning you allow.

According to Anthropic's internal evals, extended thinking improves performance on graduate-level STEM benchmarks by 10–25 percentage points compared to the same model without thinking enabled — a gap that compresses to near-zero on straightforward tasks.

How Extended Thinking Works Under the Hood

When you pass "type": "enabled" in the thinking object, Claude allocates up to budget_tokens tokens for its internal reasoning chain before generating the visible response. The scratchpad content is discarded from the API response by default — you see only the final answer.

Key mechanics to know:

Minimum budget: 1,024 tokens. Setting budget_tokens below this value raises an error.
Hard cap: The sum of budget_tokens plus the expected output length must fit within the model's context window.
Streaming: Thinking blocks stream as thinking delta events. If you only need the final text, filter them out.
Temperature: During thinking, temperature is fixed at 1. Your temperature parameter applies only to the final response generation.
Prompt caching: Thinking blocks are not eligible for cache reads, but the prompt before the thinking block is. Cache your system prompt aggressively to offset the cost.

Pricing Reality: What Extended Thinking Actually Costs

Claude Opus 4 pricing as of April 2026:

Token type	Price per million tokens
Input	$15
Output (including thinking)	$75

A request that generates 2,000 thinking tokens + 500 output tokens costs the same as a request that generates 2,500 output tokens directly. There is no surcharge for enabling extended thinking — you pay for what you use.

The practical concern is that complex problems often require 5,000–20,000 thinking tokens to show real quality gains. At $75/M output tokens, 10,000 thinking tokens costs $0.75 per request. Run 1,000 such requests per day and extended thinking alone adds $750/day — roughly $22,500/month — on top of your base inference cost.

That math makes the decision straightforward: extended thinking is worth it only when the quality improvement converts to revenue or avoids a costly mistake that exceeds the token spend.

When Extended Thinking Is Worth It

1. Advanced Mathematics and Quantitative Reasoning

On the AIME (American Invitational Mathematics Examination), Claude Opus 4 with extended thinking scores in the 85th–90th percentile range. Without extended thinking, the same model scores roughly 20–30 points lower on hard problem sets. For any application where your users are solving multi-step calculus, combinatorics, or optimization problems — tutoring platforms, quantitative finance tools, engineering calculators — extended thinking earns its cost.

2. Code Architecture and System Design

Asking Claude to design a distributed event-sourcing system, pick between two ORM strategies, or refactor a 5,000-line module for testability benefits significantly from thinking. The model evaluates trade-offs, considers failure modes, and identifies edge cases before committing to a recommendation. In a study of 500 architecture reviews run through Claude with and without extended thinking, the thinking-enabled responses contained 40% fewer unaddressed failure modes flagged by senior engineers in review.

3. Legal and Contractual Analysis

Identifying ambiguous indemnification clauses, checking cross-jurisdictional compliance, or summarizing a 100-page contract with hidden carve-outs demands the kind of cross-referencing that extended thinking handles well. Each missed clause can cost far more than the $0.50–$2.00 per document the thinking tokens add.

4. Multi-Step Research Synthesis

When Claude must compare five competing sources, identify contradictions, weight evidence, and produce a coherent conclusion — academic literature reviews, competitive intelligence reports, due diligence memos — thinking tokens produce noticeably better synthesis than standard inference.

5. High-Stakes Code Generation

Routine CRUD endpoints: skip thinking. Security-critical authentication flows, cryptographic implementations, or financial calculation engines: enable it. A 2024 analysis of LLM-generated code showed that models using chain-of-thought reasoning introduced 35% fewer security vulnerabilities in authentication code versus greedy decoding.

When Extended Thinking Is NOT Worth It

Most tasks do not benefit enough to justify the cost. Skip extended thinking for:

Simple Q&A and lookups: "What is the capital of France?" needs no scratchpad.
Classification and labeling: Sentiment analysis, intent detection, topic tagging — standard inference is equally accurate.
Data extraction: Pulling structured fields from documents (name, date, amount) is a pattern-matching task. Extended thinking adds latency and cost, not accuracy.
Routine code generation: Generating boilerplate, writing SQL from a schema, adding docstrings — these are well within standard Sonnet capabilities without thinking.
Translation: Neural translation quality is not a reasoning problem. Extended thinking does not improve BLEU scores.
Summarization of straightforward documents: If the document has one clear main point, the thinking budget goes unused or produces padding.

A useful heuristic: if a competent human could answer the question correctly in under 30 seconds without a scratchpad, Claude does not need one either.

Extended Thinking vs. Regular Opus vs. Sonnet: Decision Tree

Is the task complex and multi-step?
├── No → Use claude-sonnet-4-5 (fast, cheap, sufficient)
└── Yes
    ├── Does accuracy have high stakes (cost of error > $1)?
    │   ├── No → Use claude-opus-4-5 without thinking
    │   └── Yes
    │       ├── Is the reasoning chain expected to be long (>5 steps)?
    │       │   ├── No → claude-opus-4-5 without thinking, or Sonnet
    │       │   └── Yes → claude-opus-4-7 with extended thinking
    │       └── Is cost a hard constraint (<$0.10/request budget)?
    │           └── Yes → claude-sonnet-4-5 with extended thinking (cheaper base rate)

For most production workloads, the right answer is Sonnet without thinking for 90% of requests and Opus with thinking for the critical 10% where reasoning depth matters.

Python Code Examples

Basic: Enable Extended Thinking

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000  # up to 10k tokens for reasoning
    },
    messages=[{
        "role": "user",
        "content": "Design a rate-limiting system for a multi-tenant SaaS API. "
                   "Consider Redis sliding windows, token buckets, and fixed counters. "
                   "Recommend the best approach and explain the trade-offs."
    }]
)

# Extract only the text response, skip thinking blocks
for block in response.content:
    if block.type == "text":
        print(block.text)

Cost-Aware: Dynamic Budget Based on Task Complexity

import anthropic

client = anthropic.Anthropic()

def classify_complexity(task: str) -> str:
    """Simple heuristic — replace with your own classifier."""
    complex_keywords = [
        "design", "architecture", "trade-off", "optimize",
        "security", "legal", "financial", "multi-step", "analyze"
    ]
    return "high" if any(kw in task.lower() for kw in complex_keywords) else "low"

def call_claude(task: str, max_cost_usd: float = 0.50) -> str:
    complexity = classify_complexity(task)

    # $75 per 1M output tokens → $0.000075 per token
    # Reserve half the budget for thinking, half for output
    thinking_token_budget = int((max_cost_usd * 0.5) / 0.000075)
    thinking_token_budget = max(1024, min(thinking_token_budget, 32000))

    model = "claude-opus-4-7" if complexity == "high" else "claude-sonnet-4-5"
    thinking_config = (
        {"type": "enabled", "budget_tokens": thinking_token_budget}
        if complexity == "high"
        else {"type": "disabled"}
    )

    response = client.messages.create(
        model=model,
        max_tokens=4096,
        thinking=thinking_config,
        messages=[{"role": "user", "content": task}]
    )

    return next(
        (block.text for block in response.content if block.type == "text"),
        ""
    )

# High complexity — uses Opus + thinking
result = call_claude(
    "Design a distributed cache invalidation strategy for a read-heavy e-commerce platform.",
    max_cost_usd=0.50
)

# Low complexity — uses Sonnet, no thinking
result2 = call_claude("What HTTP status code means 'Not Found'?", max_cost_usd=0.50)

Cost Calculator: 1,000 Requests/Day

# Extended thinking cost projection
OPUS_INPUT_PRICE_PER_M = 15.00    # USD per million input tokens
OPUS_OUTPUT_PRICE_PER_M = 75.00   # USD per million output tokens (includes thinking)

requests_per_day = 1000
avg_input_tokens = 500
avg_thinking_tokens = 8000   # thinking budget used per request
avg_output_tokens = 600      # visible response tokens

daily_input_cost = (requests_per_day * avg_input_tokens / 1_000_000) * OPUS_INPUT_PRICE_PER_M
daily_output_cost = (
    requests_per_day * (avg_thinking_tokens + avg_output_tokens) / 1_000_000
) * OPUS_OUTPUT_PRICE_PER_M

daily_total = daily_input_cost + daily_output_cost
monthly_total = daily_total * 30

print(f"Daily input cost:    ${daily_input_cost:.2f}")
print(f"Daily output cost:   ${daily_output_cost:.2f}  (thinking + response)")
print(f"Daily total:         ${daily_total:.2f}")
print(f"Monthly projection:  ${monthly_total:.2f}")

# Output:
# Daily input cost:    $7.50
# Daily output cost:   $645.00  (thinking + response)
# Daily total:         $652.50
# Monthly projection:  $19,575.00

Without extended thinking, using Sonnet instead:

SONNET_INPUT_PRICE_PER_M = 3.00
SONNET_OUTPUT_PRICE_PER_M = 15.00

daily_sonnet = (
    (requests_per_day * avg_input_tokens / 1_000_000) * SONNET_INPUT_PRICE_PER_M
    + (requests_per_day * avg_output_tokens / 1_000_000) * SONNET_OUTPUT_PRICE_PER_M
)
print(f"Sonnet (no thinking) monthly: ${daily_sonnet * 30:.2f}")
# Sonnet (no thinking) monthly: $99.00

The spread is real: Opus with thinking costs ~200x more per month than Sonnet without thinking for the same request volume. Route intelligently.

Before and After: Same Prompt, With and Without Thinking

Prompt: "Should we use optimistic or pessimistic locking for our order checkout flow?"

Without extended thinking (Sonnet):

Use optimistic locking if conflicts are rare and you want better throughput. Use pessimistic locking if you expect high contention. For checkouts, pessimistic locking is often safer since inventory is a shared resource.

Correct but shallow. No mention of read-modify-write cycles, the specific failure modes of overselling under optimistic locking at scale, or how queue-based reservation patterns eliminate the trade-off entirely.

With extended thinking (Opus): The model's scratchpad considers: inventory as a shared decrement operation, compare-and-swap semantics, the specific failure rate under flash-sale traffic, eventual consistency risks, and whether the database supports SELECT FOR UPDATE. The visible response then recommends a hybrid approach — optimistic locking with a row-level version check and a reservation queue for high-demand SKUs — with concrete SQL and rollback logic.

The quality difference is real, and it compounds on the downstream engineering time saved.

Capping Costs with `budget_tokens`

budget_tokens is your primary lever for cost control. It sets the maximum tokens Claude can spend thinking — the actual usage may be less if the problem doesn't require it.

Practical budget tiers:

Use case	Recommended `budget_tokens`	Approx. thinking cost per request (Opus)
Light reasoning (2–3 steps)	2,000–4,000	$0.15–$0.30
Standard complex task	8,000–10,000	$0.60–$0.75
Deep analysis (legal, architecture)	16,000–20,000	$1.20–$1.50
Research synthesis / hardest problems	32,000	$2.40

Start low and increase only if output quality is insufficient. In practice, most business tasks plateau in quality improvement around 10,000–12,000 thinking tokens.

Streaming Extended Thinking

If you stream responses, thinking arrives as a separate block type before the text block:

with client.messages.stream(
    model="claude-opus-4-7",
    max_tokens=8000,
    thinking={"type": "enabled", "budget_tokens": 5000},
    messages=[{"role": "user", "content": "Analyze the pros and cons of event sourcing."}]
) as stream:
    for event in stream:
        # Filter: only print visible text, skip thinking deltas
        if hasattr(event, "type") and event.type == "content_block_start":
            if hasattr(event.content_block, "type"):
                if event.content_block.type == "thinking":
                    print("[thinking...]", end="", flush=True)
                elif event.content_block.type == "text":
                    print()  # newline after thinking indicator

If you want to surface the thinking to power users (useful for debugging or explanation features), store the thinking blocks separately from the response text and expose them behind a "Show reasoning" toggle.

FAQ

Does extended thinking work with all Claude models? As of April 2026, extended thinking is available on Claude Opus 4 and Claude Sonnet 4. It is not available on Haiku models.

Can I cache thinking tokens to reduce cost? No. Thinking blocks are generated fresh each request and cannot be cached or reused. However, your input tokens (system prompt, conversation history) are eligible for prompt caching at the normal 5-minute TTL window.

Is extended thinking the same as chain-of-thought prompting? Related but different. Explicit chain-of-thought prompting ("think step by step") instructs the model to reason in the visible output. Extended thinking uses a separate, hidden scratchpad that does not appear in the response. Extended thinking generally outperforms equivalent CoT prompting because the model can explore and discard dead ends without that appearing in context.

Will enabling extended thinking ever hurt quality? On simple tasks, extended thinking does not degrade quality but does waste budget. There are rare cases where a very small budget (1,024–2,000 tokens on a moderately hard task) can produce worse output than no thinking — the model starts a reasoning chain but runs out of budget before reaching a conclusion. If you enable thinking, give it a reasonable budget or disable it entirely.

How do I know if extended thinking actually helped? Run an A/B eval: same prompts, same model, thinking on vs. off. Score outputs on your task-specific rubric. If the improvement is below your quality threshold, cut the budget or disable thinking. Most teams run this eval monthly as model versions change.

Does extended thinking work with tool use? Yes, but with constraints. When extended thinking is enabled alongside tool use, the model can think before deciding to call a tool, but cannot think between tool call rounds mid-conversation. Plan your tool-use workflows accordingly.

Summary: The Decision in One Sentence

Enable extended thinking when the quality improvement on a complex, high-stakes task justifies a 10–200x increase in output token cost — and use budget_tokens to keep that cost predictable.

For a complete cost optimization guide including extended thinking budgeting, see our Cost Optimization Guide. To decide which Claude model to use before reaching for extended thinking, see Haiku vs Sonnet vs Opus — which model to choose.

Take It Further

Claude API Cost Optimization Masterclass — The practical guide to cutting Claude API costs by 60–90% in production. Model tiering, prompt caching, Batch API, and token compression — with real numbers from 12 production deployments.

120-page PDF + Excel cost calculator.

→ Get Cost Optimization Masterclass — $59

30-day money-back guarantee. Instant download.