Claude API Rate Limits: Complete Production Guide

Q: Can I use prompt caching to avoid rate limits?

Yes, partially. Cache reads (cache_read_input_tokens) count toward token limits at a reduced rate. See the prompt caching guide for implementation details.

Claude API rate limits operate on two dimensions: requests per minute (RPM) and tokens per minute (TPM). New accounts start at low limits (Tier 1) and automatically unlock higher tiers as spend increases. In production, token-per-minute limits are almost always the binding constraint — not request counts. The Anthropic SDK handles retries automatically, but you need to understand the limit structure to avoid hitting them at scale and to architect your system correctly.

Rate limit tiers (2026)

Anthropic uses a tiered system. Tiers unlock automatically when your account reaches spend thresholds.

Tier	Monthly spend	Claude Sonnet RPM	Claude Sonnet TPM
Tier 1	$0–$100	50	40,000
Tier 2	$100–$500	1,000	80,000
Tier 3	$500–$5,000	2,000	160,000
Tier 4	$5,000–$15,000	4,000	400,000
Custom	$15,000+	Contact Anthropic	Contact Anthropic

Key insight: at Tier 1, 40,000 TPM means roughly 40 requests with 1,000-token contexts per minute. For a low-traffic application, this is fine. For a batch processing job, you'll hit this immediately.

Claude Haiku has higher TPM limits at each tier because it's the designated high-throughput model.

Which limit you'll actually hit

Most developers assume they'll hit requests-per-minute first. In practice, almost everyone hits tokens-per-minute because:

System prompts add 200–2,000 tokens to every request
Long conversations accumulate context
Document processing requests can be 50,000+ tokens each

How to calculate your effective TPM utilisation:

def estimate_tpm_utilisation(
    requests_per_minute: int,
    avg_input_tokens: int,
    avg_output_tokens: int,
    tpm_limit: int,
) -> float:
    """Estimate what fraction of TPM limit you'll use."""
    total_tokens_per_minute = requests_per_minute * (avg_input_tokens + avg_output_tokens)
    return total_tokens_per_minute / tpm_limit

# Example: 20 req/min, 1500 input, 500 output, Tier 1 Sonnet (40k TPM)
utilisation = estimate_tpm_utilisation(20, 1500, 500, 40_000)
print(f"{utilisation:.1%}")  # 100.0% — you will hit the limit

Rate limit response format

When you exceed a rate limit, the API returns HTTP 429:

{
  "type": "error",
  "error": {
    "type": "rate_limit_error",
    "message": "Number of request tokens has exceeded your per-minute rate limit (40000 TPM) for claude-sonnet-4-5. Current usage: 38743, limit: 40000."
  }
}

The response headers tell you your current state:

x-ratelimit-limit-requests: 50
x-ratelimit-limit-tokens: 40000
x-ratelimit-remaining-requests: 12
x-ratelimit-remaining-tokens: 8743
x-ratelimit-reset-requests: 2026-04-27T10:15:30Z
x-ratelimit-reset-tokens: 2026-04-27T10:15:00Z

These headers are available on every response, not just 429s. Reading them proactively is better than waiting for errors.

SDK automatic retry behaviour

The Anthropic Python SDK handles 429 errors with automatic exponential backoff by default:

import anthropic

# Default: 2 retries with exponential backoff
client = anthropic.Anthropic()

# Configured: increase retries for production
client = anthropic.Anthropic(
    max_retries=4,         # Retry up to 4 times
    timeout=httpx.Timeout(
        connect=5.0,
        read=600.0,        # 10 min read timeout for long generations
        write=10.0,
        pool=5.0,
    ),
)

Retry timing: the SDK backs off at 1s, 2s, 4s, 8s (with jitter). For 4 retries, worst case is ~15 seconds of delay before final failure.

When max_retries is exhausted, the SDK raises anthropic.RateLimitError.

Production rate limit handling patterns

Pattern 1: Proactive rate limit monitoring

Read headers on every response to throttle before hitting limits:

import time
import anthropic
from dataclasses import dataclass
from typing import Optional

@dataclass
class RateLimitState:
    remaining_requests: int = 50
    remaining_tokens: int = 40_000
    reset_requests_at: Optional[float] = None
    reset_tokens_at: Optional[float] = None

rate_state = RateLimitState()

def create_message_with_monitoring(
    client: anthropic.Anthropic,
    **kwargs,
) -> anthropic.types.Message:
    """
    Create a message and update rate limit state from response headers.
    Proactively throttles before limits are hit.
    """
    global rate_state
    
    # Proactive throttle: if we're close to limits, wait for reset
    if rate_state.remaining_tokens < 5_000:
        wait_time = (rate_state.reset_tokens_at or 0) - time.time()
        if wait_time > 0:
            time.sleep(wait_time + 0.5)  # 0.5s buffer
    
    response = client.messages.create(**kwargs)
    
    # Update rate limit state from headers
    headers = response._raw_response.headers
    rate_state.remaining_requests = int(
        headers.get("x-ratelimit-remaining-requests", rate_state.remaining_requests)
    )
    rate_state.remaining_tokens = int(
        headers.get("x-ratelimit-remaining-tokens", rate_state.remaining_tokens)
    )
    
    return response

Pattern 2: Request queue with TPM budgeting

For batch processing, queue requests and pace them to stay within TPM:

import asyncio
from collections import deque

class TokenBudgetedQueue:
    """
    Process requests at a rate that stays within TPM limits.
    """
    def __init__(self, tpm_limit: int, safety_factor: float = 0.8):
        self.tpm_limit = tpm_limit
        # Only use 80% of limit to avoid hitting edge
        self.effective_tpm = int(tpm_limit * safety_factor)
        self.tokens_used_this_minute: deque = deque()
    
    def _tokens_used_last_60s(self) -> int:
        now = time.time()
        # Remove entries older than 60 seconds
        while self.tokens_used_this_minute and \
              self.tokens_used_this_minute[0][0] < now - 60:
            self.tokens_used_this_minute.popleft()
        return sum(t for _, t in self.tokens_used_this_minute)
    
    async def submit(
        self,
        client: anthropic.AsyncAnthropic,
        estimated_tokens: int,
        **kwargs,
    ) -> anthropic.types.Message:
        """Submit a request, waiting if necessary to stay within budget."""
        while self._tokens_used_last_60s() + estimated_tokens > self.effective_tpm:
            await asyncio.sleep(1)
        
        response = await client.messages.create(**kwargs)
        actual_tokens = response.usage.input_tokens + response.usage.output_tokens
        self.tokens_used_this_minute.append((time.time(), actual_tokens))
        return response

Pattern 3: Graceful degradation on rate limit

For user-facing applications, degrade gracefully rather than erroring:

async def get_ai_response_with_fallback(
    client: anthropic.AsyncAnthropic,
    messages: list[dict],
    fallback_message: str = "I'm processing a high volume of requests right now. Please try again in a moment.",
) -> dict:
    """
    Return AI response, or a graceful fallback if rate-limited.
    """
    try:
        response = await client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=1024,
            messages=messages,
        )
        return {
            "success": True,
            "content": response.content[0].text,
        }
    except anthropic.RateLimitError as e:
        # Log for monitoring
        print(f"Rate limit hit: {e}")
        return {
            "success": False,
            "content": fallback_message,
            "error": "rate_limit",
        }
    except anthropic.APITimeoutError:
        return {
            "success": False,
            "content": fallback_message,
            "error": "timeout",
        }

Token optimisation to avoid limits

Hitting rate limits is often a token efficiency problem:

1. Truncate conversation history before hitting limits (rather than waiting for a 429):

def trim_history_for_rate_limit(
    messages: list[dict],
    remaining_tokens: int,
    target_tokens: int = 10_000,
) -> list[dict]:
    """Trim history when we're close to the TPM limit."""
    if remaining_tokens < 15_000:
        # Keep only the last 6 messages
        return messages[-6:]
    return messages

2. Use Haiku for pre-processing to reduce tokens sent to Sonnet:

# Instead of sending a 10,000-token document to Sonnet,
# use Haiku to extract only the relevant section first
relevant_section = client.messages.create(
    model="claude-haiku-4-5",
    max_tokens=2048,
    messages=[{
        "role": "user",
        "content": f"Extract only the section about pricing from this document:\n\n{large_document}"
    }]
).content[0].text

# Now send only the relevant section to Sonnet
answer = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": f"What are the pricing tiers?\n\n{relevant_section}"
    }]
)

Frequently asked questions

How do I request a rate limit increase? Rate limits increase automatically when you hit the spend threshold for the next tier. For custom limits above Tier 4, contact Anthropic sales. There's no manual form to request increases on lower tiers — you need to spend your way up.

Are rate limits per API key or per account? Per API key by default. Multiple API keys on the same account share the same limits. You can't increase limits by creating multiple keys.

What happens to requests in flight when the rate limit resets? The rate limit window resets every 60 seconds for TPM. Requests that are in-flight when the window resets count toward the new window, not the old one.

Can I use prompt caching to avoid rate limits? Yes, partially. Cache reads (cache_read_input_tokens) count toward token limits at a reduced rate. See the prompt caching guide for implementation details.

Is there a daily or monthly token limit? No per-day or per-month hard limits — only per-minute limits. Your monthly spend controls which tier you're on, which controls per-minute limits.

Related guides

Claude API Error Handling: Production Patterns — comprehensive error handling including rate limit errors
Claude Prompt Caching: Complete Guide to 90% Cost Reduction — reduce token usage with caching

Take It Further

Claude Agent SDK Cookbook: 40 Production Patterns — Pattern 7 covers the complete Rate Limit Architecture: the TokenBudgetedQueue implementation, proactive monitoring, distributed rate limiting with Redis, and the tier unlock strategy for production scale.

→ Get the Agent SDK Cookbook — $49

30-day money-back guarantee. Instant download.

Claude API Rate Limits: Complete Production Guide

Rate limit tiers (2026)

Which limit you'll actually hit

Rate limit response format

SDK automatic retry behaviour

Production rate limit handling patterns

Pattern 1: Proactive rate limit monitoring

Pattern 2: Request queue with TPM budgeting

Pattern 3: Graceful degradation on rate limit

Token optimisation to avoid limits

Frequently asked questions

Related guides

Take It Further

Related guides

Claude API Error Handling: Production Patterns for Python and TypeScript

Claude API Error Handling: Production Patterns with Retry Logic

Claude API Rate Limits: Tiers, Headers, and Retry Strategies (2026)

Claude API Error Handling: Rate Limits, Retries, and Production Patterns

Tools and references