← All guides

Claude Extended Thinking: How to Use Thinking Mode (2026 Guide)

Claude Extended Thinking lets the model reason step-by-step before answering. Learn how budget_tokens, thinking blocks, streaming, and cost work in practice.

Claude Extended Thinking: How to Use Thinking Mode (2026 Guide)

Claude Extended Thinking is a feature that gives the model a private scratchpad to reason through a problem before returning its final answer. When you enable it, Claude emits one or more "thinking blocks" — internal chain-of-thought text — before the visible response. You control how much compute the model can spend on this reasoning by setting a budget_tokens parameter. The result is meaningfully higher accuracy on hard math, multi-step logic, code debugging, and legal analysis tasks where a single-pass answer is error-prone. Extended Thinking is available on claude-sonnet-4-5 and later, and on claude-opus-4 models.


How Extended Thinking works

The mental model is straightforward: instead of asking Claude to produce an answer in a single forward pass, you give it permission to "think out loud" in a hidden buffer. That buffer is bounded by budget_tokens.

The two response block types

When thinking is enabled, the API response contains an array of content blocks that can include two types:

Block type What it contains Visible to end users?
thinking Internal reasoning — hypotheses, scratch calculations, dead ends No (you choose)
text The final, polished answer Yes

The thinking block text is the model's unfiltered reasoning chain. It may include things like "wait, that contradicts what I said earlier — let me recalculate." That self-correction is precisely why accuracy improves: the model can catch its own mistakes before committing to an answer.

The budget_tokens parameter

budget_tokens is the upper limit on how many tokens Claude can spend on internal reasoning. It is not a guarantee — the model may use fewer tokens if the problem is simpler than expected.

Practical ranges as of May 2026:

Budget Best for
1,000–2,000 tokens Straightforward reasoning, moderate code tasks
5,000–10,000 tokens Complex math proofs, multi-step legal analysis
16,000–32,000 tokens Hard competition-style problems, large-codebase debugging

Important: budget_tokens must be at least 1,024 and must be less than the model's max_tokens limit. Set max_tokens high enough to accommodate both the thinking budget and the final response.


When to use Extended Thinking

Extended Thinking is not a universal upgrade. It adds latency and cost, so apply it selectively to tasks where single-pass reasoning fails. The strongest use cases are:

Mathematics and quantitative reasoning — Olympiad-style problems, financial modeling with multi-step calculations, statistical proofs. On AIME 2024, Claude 3.7 Sonnet with thinking enabled scored 80% vs. 55% without.

Complex coding tasks — Debugging multi-file systems where the root cause requires tracing execution across multiple call stacks, or generating algorithms that require correctness proofs.

Legal and contract analysis — Identifying clause interactions across a long document, checking for conflicting indemnification and limitation-of-liability provisions.

Multi-step research synthesis — Tasks where the model must hold multiple facts in "working memory" and draw non-obvious conclusions from their combination.

Tasks where you need to inspect the reasoning — If your application needs to audit why the model reached a conclusion (compliance, medical triage, financial advice), the thinking block provides a readable trace.

For simpler classification, summarization, or extraction tasks, thinking mode adds overhead without meaningful quality lift. Apply the model routing logic first — if Haiku handles the task at acceptable quality, Extended Thinking is overkill.


Python code example: enabling thinking mode

Install the Anthropic SDK if you haven't already:

pip install anthropic

Basic thinking request

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=16000,          # must exceed budget_tokens + expected answer length
    thinking={
        "type": "enabled",
        "budget_tokens": 10000  # up to 10k tokens of internal reasoning
    },
    messages=[{
        "role": "user",
        "content": (
            "A bag contains 3 red, 4 blue, and 5 green marbles. "
            "Two marbles are drawn without replacement. "
            "What is the probability that both marbles are the same color? "
            "Show your exact fraction in lowest terms."
        )
    }]
)

# Iterate over content blocks
for block in response.content:
    if block.type == "thinking":
        print("=== THINKING ===")
        print(block.thinking)
        print()
    elif block.type == "text":
        print("=== ANSWER ===")
        print(block.text)

Extracting just the final answer

If you don't need to expose the thinking chain to users, filter it out:

def get_final_answer(response) -> str:
    """Return only the text block from a thinking-enabled response."""
    for block in response.content:
        if block.type == "text":
            return block.text
    return ""

answer = get_final_answer(response)
print(answer)

Passing thinking blocks in multi-turn conversations

When continuing a multi-turn conversation, you must include the thinking blocks in the message history, or Claude loses its reasoning context. The SDK handles serialization automatically if you pass the full response.content list:

import anthropic

client = anthropic.Anthropic()

# Turn 1
first_response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 8000},
    messages=[{
        "role": "user",
        "content": "Analyze this contract clause for ambiguity: [clause text here]"
    }]
)

# Build history — include the full content list (thinking + text blocks)
history = [
    {"role": "user", "content": "Analyze this contract clause for ambiguity: [clause text here]"},
    {"role": "assistant", "content": first_response.content},  # includes thinking blocks
]

# Turn 2 — follow-up question
second_response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 8000},
    messages=history + [{
        "role": "user",
        "content": "How would a court likely interpret the ambiguity you identified?"
    }]
)

Cost breakdown: thinking tokens are billed as output

This is the most important thing to understand before enabling thinking in production: thinking tokens are billed at the output token rate, which is typically 5x the input rate.

Using claude-sonnet-4-5 pricing as of May 2026 ($3/MTok input, $15/MTok output):

Scenario Input tokens Thinking tokens Output tokens Cost per call
No thinking 500 0 300 $0.0060
budget_tokens=2,000 (actual use: 1,800) 500 1,800 300 $0.0330
budget_tokens=10,000 (actual use: 7,200) 500 7,200 300 $0.1140

A call with 10k thinking tokens costs roughly 19x more than the same call without thinking. At 1,000 calls/day, that is $114/day vs. $6/day.

This is why applying Extended Thinking selectively matters so much. If your use case mixes simple and complex requests, use a classifier to route only the hard ones to thinking mode. The P5 Cost Optimization Masterclass covers exactly this pattern — routing, cost monitoring, and circuit breakers for thinking-enabled pipelines — with production-ready Python code.

For a real-world example of how model selection and feature gating cut API bills by 85%, see the Claude API cost case study.


Comparison: with vs. without thinking on the same task

To make the tradeoff concrete, here is a side-by-side on a multi-step logic problem.

Prompt: "Three friends — Alice, Bob, and Carol — each have a different number of siblings: 1, 2, and 3. Alice does not have the most siblings. Bob has more siblings than Carol. Who has 3 siblings?"

Without thinking (standard mode):

Bob has 3 siblings.

Correct — but no audit trail. If the problem were harder, you would not know where the reasoning failed.

With thinking enabled (budget_tokens=1000):

Thinking block:

Let me assign variables. The numbers are 1, 2, 3.
Alice does not have the most → Alice ≠ 3.
Bob > Carol in siblings.
If Carol=1 and Bob=2, then Alice=3. But Alice ≠ 3. Contradiction.
If Carol=1 and Bob=3, then Alice=2. Alice ≠ 3 ✓. Bob > Carol ✓.
If Carol=2 and Bob=3, then Alice=1. Alice ≠ 3 ✓. Bob > Carol ✓.
Both scenarios satisfy constraints. But we need a unique answer.
Wait — re-reading: "each have a different number" — yes, already assumed.
Second scenario: Carol=2, Bob=3, Alice=1. Check: Bob(3) > Carol(2) ✓. Alice(1) ≠ 3 ✓.
First scenario also works: Carol=1, Bob=3, Alice=2. Bob(3) > Carol(1) ✓. Alice(2) ≠ 3 ✓.
Hmm, both scenarios give Bob=3. The answer is the same in both cases.
Answer: Bob has 3 siblings.

Text block:

Bob has 3 siblings. Here's the reasoning: Alice doesn't have the most (so Alice ≠ 3). 
Bob has more than Carol (so Bob > Carol). The only value that satisfies both is Bob = 3, 
which forces Alice and Carol to share 1 and 2. Both assignments of 1 and 2 to Alice/Carol 
are consistent with the constraints, but Bob = 3 holds in both cases.

The thinking block reveals the model caught a potential ambiguity (two valid assignments for Alice/Carol) and verified the answer was unique despite that. Without thinking, you get the right answer but no visibility into whether the model truly reasoned through it or pattern-matched.


Streaming with thinking enabled

For production applications, streaming is essential for perceived responsiveness. With Extended Thinking, the stream emits thinking deltas before the text deltas. You can render a "thinking…" indicator until the text blocks start.

import anthropic

client = anthropic.Anthropic()

with client.messages.stream(
    model="claude-sonnet-4-5",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 8000},
    messages=[{
        "role": "user",
        "content": "Explain why the halting problem is undecidable, with a formal proof sketch."
    }]
) as stream:
    current_block_type = None

    for event in stream:
        # Track block transitions
        if hasattr(event, "type"):
            if event.type == "content_block_start":
                block = event.content_block
                current_block_type = block.type
                if block.type == "thinking":
                    print("\n[Thinking...]", flush=True)
                elif block.type == "text":
                    print("\n[Answer]\n", flush=True)

            elif event.type == "content_block_delta":
                delta = event.delta
                if delta.type == "thinking_delta":
                    # Optionally suppress thinking from UI; log for debugging
                    pass  # print(delta.thinking, end="", flush=True)
                elif delta.type == "text_delta":
                    print(delta.text, end="", flush=True)

    print()  # final newline

Streaming with the async client (production pattern)

import asyncio
import anthropic

async def stream_with_thinking(prompt: str, budget: int = 8000) -> str:
    """Stream a thinking-enabled response; return the final text answer."""
    client = anthropic.AsyncAnthropic()
    full_text = []

    async with client.messages.stream(
        model="claude-sonnet-4-5",
        max_tokens=16000,
        thinking={"type": "enabled", "budget_tokens": budget},
        messages=[{"role": "user", "content": prompt}]
    ) as stream:
        async for event in stream:
            if hasattr(event, "type") and event.type == "content_block_delta":
                delta = event.delta
                if delta.type == "text_delta":
                    full_text.append(delta.text)

    return "".join(full_text)

# Usage
answer = asyncio.run(stream_with_thinking(
    "Prove that there are infinitely many prime numbers.",
    budget=5000
))
print(answer)

For a full async production setup with connection pooling, rate-limit handling, and fallback chains, see the Claude API production architecture guide.


Integrating Extended Thinking into agent pipelines

If you are building multi-step agents with the Claude Agent SDK, Extended Thinking is most useful at the planning and reasoning nodes of a pipeline — not at every step. A common pattern:

  1. Planner node (thinking enabled, high budget) — breaks the task into subtasks
  2. Executor nodes (thinking disabled, Haiku or Sonnet) — carry out individual subtasks
  3. Verifier node (thinking enabled, medium budget) — checks results for consistency

This keeps cost proportional to task complexity. The Claude Agent SDK guide shows how to wire these nodes together with tool use and handoffs.


Frequently Asked Questions

Is Extended Thinking available on all Claude models?

No. As of May 2026, Extended Thinking is supported on claude-sonnet-4-5, claude-sonnet-4-5-20251101, and claude-opus-4 model families. It is not available on Haiku models. Check the Anthropic model documentation for the latest list, as support expands with each model release.

Can I cache thinking blocks with prompt caching?

Yes, but with a constraint: you can cache the input context (system prompt, prior conversation) using the standard cache_control: {"type": "breakpoint"} pattern. Thinking blocks in the assistant turn of the message history are also preserved and sent back with the request. However, thinking blocks themselves are not eligible to be read from cache as an assistant prefill — you must include them verbatim from the prior response. This means multi-turn conversations with thinking enabled benefit from input-side caching but do not get thinking-block caching. For a deep dive on caching economics and break-even math, see the Claude API cost case study.

How do I pick the right budget_tokens value?

Start with a value 3–5x what you estimate the final answer needs, then measure. The model typically uses 60–80% of the budget on hard problems. Run 20–30 test prompts representative of your production traffic and look at response.usage.cache_read_input_tokens and the thinking block lengths. If the model consistently hits the ceiling (thinking blocks near budget_tokens), increase the budget. If thinking blocks are consistently less than 30% of the budget, reduce it to cut cost. A good monitoring approach is to log len(block.thinking) for every call and set up an alert if the median crosses 80% of your budget limit — that signals you need to raise the cap or the problem difficulty has increased.


Summary

Extended Thinking gives Claude a private reasoning buffer before it commits to an answer. The key implementation decisions are:

Managing the cost side of Extended Thinking — routing, monitoring, and circuit breakers — is covered end-to-end in the P5 Cost Optimization Masterclass. It includes production-ready Python templates for thinking-aware pipelines and a cost dashboard you can deploy in under an hour.

If you are looking to improve the prompts that feed into thinking mode, the P1 Power Prompts 300 library includes 300 tested prompt patterns optimized for Claude's extended reasoning — covering math, code, legal, and research domains.

AI Disclosure: Drafted with Claude Code; feature details from Anthropic documentation as of May 2026.

Tools and references