Claude API Python Tutorial: Complete Guide with Code Examples

Q: Which model should I use for Python projects?

Use claude-sonnet-4-5 as your default — it balances capability and cost well for most tasks. Switch to claude-haiku-4-5 for high-volume, lower-complexity calls (classification, summarization, extraction) where you need to minimize cost and latency.

Q: How do I count tokens before making a call?

Use the client.messages.count_tokens() method to estimate token usage without consuming your output quota:

Q: Can I use the Python SDK inside a FastAPI app?

Yes. Use AsyncAnthropic with await inside async route handlers. Pass the client as a dependency or instantiate it once at module level (the client is safe to share across requests).

The Claude API Python SDK (anthropic) lets you call Claude models from any Python script or application with a single pip install. This guide covers every major feature — messages, system prompts, multi-turn conversations, streaming, tool use, prompt caching, and async — with working code you can paste and run today.

Installation and environment setup

Install the SDK and python-dotenv for managing your API key securely:

pip install anthropic python-dotenv

Store your key in a .env file in your project root:

ANTHROPIC_API_KEY=sk-ant-...

Load it at the top of any script:

from dotenv import load_dotenv
import os

load_dotenv()
# ANTHROPIC_API_KEY is now available as an environment variable
# The Anthropic client picks it up automatically

The Anthropic() client reads ANTHROPIC_API_KEY from the environment by default, so you never need to pass it explicitly in your code.

Basic messages.create() call

The messages.create() method is the core of the API. Every request sends a list of messages and receives a response object:

import anthropic

client = anthropic.Anthropic()

message = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Explain the difference between a list and a tuple in Python."}
    ]
)

print(message.content[0].text)

The response object contains:

message.content — a list of content blocks (usually one TextBlock)
message.usage.input_tokens and message.usage.output_tokens — token counts for cost tracking
message.stop_reason — "end_turn", "max_tokens", or "tool_use"

System prompts

A system prompt sets the model's persona and constraints. Pass it as the system parameter (not as a message):

message = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system="You are a senior Python engineer. Respond with concise, production-quality code. Include type hints. No explanatory prose unless asked.",
    messages=[
        {"role": "user", "content": "Write a function that retries a callable with exponential backoff."}
    ]
)

print(message.content[0].text)

A well-crafted system prompt reduces output tokens and improves consistency across requests. See How to Write System Prompts for Claude for patterns that work.

Multi-turn conversations

The API is stateless — you maintain conversation history yourself by appending each turn to the messages list:

import anthropic

client = anthropic.Anthropic()

conversation = []

def chat(user_message: str) -> str:
    conversation.append({"role": "user", "content": user_message})

    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        system="You are a helpful Python tutor.",
        messages=conversation
    )

    assistant_reply = response.content[0].text
    conversation.append({"role": "assistant", "content": assistant_reply})
    return assistant_reply

# Usage
print(chat("What is a decorator in Python?"))
print(chat("Can you show me a simple example?"))
print(chat("How would I stack two decorators?"))

Each call to chat() sends the full conversation history, so Claude has complete context. Keep an eye on token usage — very long conversations approach the context window limit and increase cost per call.

Streaming responses

Streaming delivers tokens to the client as they are generated, which cuts perceived latency significantly for long responses. Use the stream() context manager:

import anthropic

client = anthropic.Anthropic()

with client.messages.stream(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Write a Python script that parses a CSV file and computes column statistics."}
    ]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

# Access the final message after streaming completes
final_message = stream.get_final_message()
print(f"\n\nTotal tokens: {final_message.usage.input_tokens + final_message.usage.output_tokens}")

The stream.text_stream iterator yields text deltas. Call stream.get_final_message() after the loop to access usage stats and stop reason.

Tool use (function calling)

Tool use lets Claude call functions you define. You pass a list of tool schemas; Claude decides when to call them and with what arguments. Your code executes the function and returns the result.

import anthropic
import json

client = anthropic.Anthropic()

# Define the tool schema
tools = [
    {
        "name": "get_weather",
        "description": "Get current weather for a city. Returns temperature in Celsius and conditions.",
        "input_schema": {
            "type": "object",
            "properties": {
                "city": {
                    "type": "string",
                    "description": "City name, e.g. 'Seoul' or 'Tokyo'"
                }
            },
            "required": ["city"]
        }
    }
]

# Stub implementation — replace with a real weather API call
def get_weather(city: str) -> dict:
    mock_data = {
        "Seoul": {"temperature": 18, "conditions": "Partly cloudy"},
        "Tokyo": {"temperature": 22, "conditions": "Sunny"},
    }
    return mock_data.get(city, {"temperature": 15, "conditions": "Unknown"})

messages = [{"role": "user", "content": "What's the weather like in Seoul right now?"}]

# First call — Claude may decide to use the tool
response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    tools=tools,
    messages=messages
)

# Handle tool use
if response.stop_reason == "tool_use":
    tool_use_block = next(b for b in response.content if b.type == "tool_use")
    tool_name = tool_use_block.name
    tool_input = tool_use_block.input

    # Execute the function
    result = get_weather(**tool_input)

    # Append Claude's response and the tool result to the conversation
    messages.append({"role": "assistant", "content": response.content})
    messages.append({
        "role": "user",
        "content": [
            {
                "type": "tool_result",
                "tool_use_id": tool_use_block.id,
                "content": json.dumps(result)
            }
        ]
    })

    # Second call — Claude uses the tool result to form its answer
    final_response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        tools=tools,
        messages=messages
    )
    print(final_response.content[0].text)

For more complex patterns — chaining multiple tools, parallel tool calls, and structured tool output validation — see Claude Tool Use and Function Calling.

Prompt caching

Prompt caching stores a prefix of your prompt in Anthropic's infrastructure and reuses it across calls. Cache hits cost 10% of the normal input token price, making it a significant saving when you have a large, repeated context (system prompt, documents, few-shot examples).

How to enable caching

Add "cache_control": {"type": "ephemeral"} to the last content block of the prefix you want cached:

import anthropic

client = anthropic.Anthropic()

# Large system prompt — a good candidate for caching
SYSTEM_DOCS = """
You are an expert Python code reviewer. You follow PEP 8, prefer explicit over implicit,
and always suggest type hints. [... imagine 2,000 more tokens of style guidelines here ...]
"""

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": SYSTEM_DOCS,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": "Review this function:\n\ndef add(a, b):\n    return a+b"}
    ]
)

# Check cache usage in the response
usage = response.usage
print(f"Input tokens:        {usage.input_tokens}")
print(f"Cache creation:      {usage.cache_creation_input_tokens}")
print(f"Cache read (saved):  {usage.cache_read_input_tokens}")

Cost savings math

Assume your system prompt is 2,000 tokens and you make 1,000 calls per day. With claude-sonnet-4-5 pricing (approximate):

Scenario	Tokens charged	Daily cost
No caching	2,000 x 1,000 = 2M input tokens	~$6.00
With caching (after first call)	200 x 999 (10%) + 2,000 x 1	~$0.60

That is roughly a 90% reduction on the cached portion. The cache TTL is 5 minutes, so for bursty workloads the savings are even more pronounced.

Async with AsyncAnthropic

For applications that serve concurrent requests — web APIs, async pipelines — use AsyncAnthropic:

import asyncio
import anthropic

async_client = anthropic.AsyncAnthropic()

async def summarize(text: str) -> str:
    response = await async_client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=256,
        messages=[{"role": "user", "content": f"Summarize in one sentence:\n\n{text}"}]
    )
    return response.content[0].text

async def main():
    texts = [
        "Python is a high-level, interpreted programming language...",
        "Asyncio is a library for writing concurrent code using async/await syntax...",
        "Type hints in Python allow you to annotate variables and function signatures...",
    ]

    # Run all summaries concurrently
    summaries = await asyncio.gather(*[summarize(t) for t in texts])

    for s in summaries:
        print(s)

asyncio.run(main())

AsyncAnthropic supports the same parameters as the sync client. Use claude-haiku-4-5 for lightweight, high-throughput tasks where speed matters more than depth.

Error handling

Production integrations must handle rate limits and transient errors. The SDK raises typed exceptions you can catch individually:

import anthropic
import time

client = anthropic.Anthropic()

def call_with_retry(messages: list, max_retries: int = 3) -> str:
    for attempt in range(max_retries):
        try:
            response = client.messages.create(
                model="claude-sonnet-4-5",
                max_tokens=1024,
                messages=messages
            )
            return response.content[0].text

        except anthropic.RateLimitError as e:
            # 429 — back off and retry
            wait = 2 ** attempt
            print(f"Rate limit hit. Waiting {wait}s (attempt {attempt + 1}/{max_retries})")
            time.sleep(wait)

        except anthropic.APIStatusError as e:
            if e.status_code >= 500:
                # Transient server error — retry
                wait = 2 ** attempt
                print(f"Server error {e.status_code}. Waiting {wait}s")
                time.sleep(wait)
            else:
                # 4xx client error — do not retry, surface to caller
                raise

        except anthropic.APIConnectionError:
            # Network issue — retry
            time.sleep(2 ** attempt)

    raise RuntimeError(f"Failed after {max_retries} retries")

Key rule: only retry 429 and 5xx responses. A 400 or 422 means your request has a bug — retrying will not help.

Frequently asked questions

How do I get my Claude API key? Sign up at console.anthropic.com, go to "API Keys" in the sidebar, and create a new key. Store it in a .env file and never commit it to version control.

Which model should I use for Python projects? Use claude-sonnet-4-5 as your default — it balances capability and cost well for most tasks. Switch to claude-haiku-4-5 for high-volume, lower-complexity calls (classification, summarization, extraction) where you need to minimize cost and latency.

How do I count tokens before making a call? Use the client.messages.count_tokens() method to estimate token usage without consuming your output quota:

response = client.messages.count_tokens(
    model="claude-sonnet-4-5",
    messages=[{"role": "user", "content": "Hello, how are you?"}]
)
print(response.input_tokens)

Why is prompt caching not reducing my costs? The cache TTL is 5 minutes. If more than 5 minutes pass between requests that share the same prefix, the cache expires and must be recreated (charged at 25% of input token price). High-frequency or long-running applications benefit most. Also confirm you are reading cache_read_input_tokens in the response usage object — if it is always 0, the cache control block may not be in the correct position.

Can I use the Python SDK inside a FastAPI app? Yes. Use AsyncAnthropic with await inside async route handlers. Pass the client as a dependency or instantiate it once at module level (the client is safe to share across requests).

Related guides

Claude API for Beginners — start here if you are new to the API
Claude Tool Use and Function Calling — deep dive on multi-step tool chains
Claude API Streaming Guide — streaming architecture patterns for production apps
How to Write System Prompts for Claude — system prompt patterns that improve output quality and cut tokens

Take It Further

Claude Agent SDK Cookbook: 40 Production Patterns — 40 working Python patterns covering the full SDK surface: streaming pipelines, tool use chains, multi-agent coordination, batch processing, and production error handling — all tested and ready to adapt.

→ Get the Agent SDK Cookbook — $49

30-day money-back guarantee. Instant download.

Claude API Python Tutorial: Complete Guide with Code Examples

Installation and environment setup

Basic messages.create() call

System prompts

Multi-turn conversations

Streaming responses

Tool use (function calling)

Prompt caching

How to enable caching

Cost savings math

Async with AsyncAnthropic

Error handling

Frequently asked questions

Related guides

Take It Further

Related guides

Claude API TypeScript Tutorial: Complete Guide with Node.js Examples

Claude API for Beginners: Your First API Call in 10 Minutes

Claude Python SDK Tutorial: Complete Setup and Usage Guide (2026)

Claude API Python SDK: Complete Quickstart (2026)

Tools and references