Claude API Rate Limits: Complete Production Guide
Claude API rate limits operate on two dimensions: requests per minute (RPM) and tokens per minute (TPM). New accounts start at low limits (Tier 1) and automatically unlock higher tiers as spend increases. In production, token-per-minute limits are almost always the binding constraint — not request counts. The Anthropic SDK handles retries automatically, but you need to understand the limit structure to avoid hitting them at scale and to architect your system correctly.
Rate limit tiers (2026)
Anthropic uses a tiered system. Tiers unlock automatically when your account reaches spend thresholds.
| Tier | Monthly spend | Claude Sonnet RPM | Claude Sonnet TPM |
|---|---|---|---|
| Tier 1 | $0–$100 | 50 | 40,000 |
| Tier 2 | $100–$500 | 1,000 | 80,000 |
| Tier 3 | $500–$5,000 | 2,000 | 160,000 |
| Tier 4 | $5,000–$15,000 | 4,000 | 400,000 |
| Custom | $15,000+ | Contact Anthropic | Contact Anthropic |
Key insight: at Tier 1, 40,000 TPM means roughly 40 requests with 1,000-token contexts per minute. For a low-traffic application, this is fine. For a batch processing job, you'll hit this immediately.
Claude Haiku has higher TPM limits at each tier because it's the designated high-throughput model.
Which limit you'll actually hit
Most developers assume they'll hit requests-per-minute first. In practice, almost everyone hits tokens-per-minute because:
- System prompts add 200–2,000 tokens to every request
- Long conversations accumulate context
- Document processing requests can be 50,000+ tokens each
How to calculate your effective TPM utilisation:
def estimate_tpm_utilisation(
requests_per_minute: int,
avg_input_tokens: int,
avg_output_tokens: int,
tpm_limit: int,
) -> float:
"""Estimate what fraction of TPM limit you'll use."""
total_tokens_per_minute = requests_per_minute * (avg_input_tokens + avg_output_tokens)
return total_tokens_per_minute / tpm_limit
# Example: 20 req/min, 1500 input, 500 output, Tier 1 Sonnet (40k TPM)
utilisation = estimate_tpm_utilisation(20, 1500, 500, 40_000)
print(f"{utilisation:.1%}") # 100.0% — you will hit the limit
Rate limit response format
When you exceed a rate limit, the API returns HTTP 429:
{
"type": "error",
"error": {
"type": "rate_limit_error",
"message": "Number of request tokens has exceeded your per-minute rate limit (40000 TPM) for claude-sonnet-4-5. Current usage: 38743, limit: 40000."
}
}
The response headers tell you your current state:
x-ratelimit-limit-requests: 50
x-ratelimit-limit-tokens: 40000
x-ratelimit-remaining-requests: 12
x-ratelimit-remaining-tokens: 8743
x-ratelimit-reset-requests: 2026-04-27T10:15:30Z
x-ratelimit-reset-tokens: 2026-04-27T10:15:00Z
These headers are available on every response, not just 429s. Reading them proactively is better than waiting for errors.
SDK automatic retry behaviour
The Anthropic Python SDK handles 429 errors with automatic exponential backoff by default:
import anthropic
# Default: 2 retries with exponential backoff
client = anthropic.Anthropic()
# Configured: increase retries for production
client = anthropic.Anthropic(
max_retries=4, # Retry up to 4 times
timeout=httpx.Timeout(
connect=5.0,
read=600.0, # 10 min read timeout for long generations
write=10.0,
pool=5.0,
),
)
Retry timing: the SDK backs off at 1s, 2s, 4s, 8s (with jitter). For 4 retries, worst case is ~15 seconds of delay before final failure.
When max_retries is exhausted, the SDK raises anthropic.RateLimitError.
Production rate limit handling patterns
Pattern 1: Proactive rate limit monitoring
Read headers on every response to throttle before hitting limits:
import time
import anthropic
from dataclasses import dataclass
from typing import Optional
@dataclass
class RateLimitState:
remaining_requests: int = 50
remaining_tokens: int = 40_000
reset_requests_at: Optional[float] = None
reset_tokens_at: Optional[float] = None
rate_state = RateLimitState()
def create_message_with_monitoring(
client: anthropic.Anthropic,
**kwargs,
) -> anthropic.types.Message:
"""
Create a message and update rate limit state from response headers.
Proactively throttles before limits are hit.
"""
global rate_state
# Proactive throttle: if we're close to limits, wait for reset
if rate_state.remaining_tokens < 5_000:
wait_time = (rate_state.reset_tokens_at or 0) - time.time()
if wait_time > 0:
time.sleep(wait_time + 0.5) # 0.5s buffer
response = client.messages.create(**kwargs)
# Update rate limit state from headers
headers = response._raw_response.headers
rate_state.remaining_requests = int(
headers.get("x-ratelimit-remaining-requests", rate_state.remaining_requests)
)
rate_state.remaining_tokens = int(
headers.get("x-ratelimit-remaining-tokens", rate_state.remaining_tokens)
)
return response
Pattern 2: Request queue with TPM budgeting
For batch processing, queue requests and pace them to stay within TPM:
import asyncio
from collections import deque
class TokenBudgetedQueue:
"""
Process requests at a rate that stays within TPM limits.
"""
def __init__(self, tpm_limit: int, safety_factor: float = 0.8):
self.tpm_limit = tpm_limit
# Only use 80% of limit to avoid hitting edge
self.effective_tpm = int(tpm_limit * safety_factor)
self.tokens_used_this_minute: deque = deque()
def _tokens_used_last_60s(self) -> int:
now = time.time()
# Remove entries older than 60 seconds
while self.tokens_used_this_minute and \
self.tokens_used_this_minute[0][0] < now - 60:
self.tokens_used_this_minute.popleft()
return sum(t for _, t in self.tokens_used_this_minute)
async def submit(
self,
client: anthropic.AsyncAnthropic,
estimated_tokens: int,
**kwargs,
) -> anthropic.types.Message:
"""Submit a request, waiting if necessary to stay within budget."""
while self._tokens_used_last_60s() + estimated_tokens > self.effective_tpm:
await asyncio.sleep(1)
response = await client.messages.create(**kwargs)
actual_tokens = response.usage.input_tokens + response.usage.output_tokens
self.tokens_used_this_minute.append((time.time(), actual_tokens))
return response
Pattern 3: Graceful degradation on rate limit
For user-facing applications, degrade gracefully rather than erroring:
async def get_ai_response_with_fallback(
client: anthropic.AsyncAnthropic,
messages: list[dict],
fallback_message: str = "I'm processing a high volume of requests right now. Please try again in a moment.",
) -> dict:
"""
Return AI response, or a graceful fallback if rate-limited.
"""
try:
response = await client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=messages,
)
return {
"success": True,
"content": response.content[0].text,
}
except anthropic.RateLimitError as e:
# Log for monitoring
print(f"Rate limit hit: {e}")
return {
"success": False,
"content": fallback_message,
"error": "rate_limit",
}
except anthropic.APITimeoutError:
return {
"success": False,
"content": fallback_message,
"error": "timeout",
}
Token optimisation to avoid limits
Hitting rate limits is often a token efficiency problem:
1. Truncate conversation history before hitting limits (rather than waiting for a 429):
def trim_history_for_rate_limit(
messages: list[dict],
remaining_tokens: int,
target_tokens: int = 10_000,
) -> list[dict]:
"""Trim history when we're close to the TPM limit."""
if remaining_tokens < 15_000:
# Keep only the last 6 messages
return messages[-6:]
return messages
2. Use Haiku for pre-processing to reduce tokens sent to Sonnet:
# Instead of sending a 10,000-token document to Sonnet,
# use Haiku to extract only the relevant section first
relevant_section = client.messages.create(
model="claude-haiku-4-5",
max_tokens=2048,
messages=[{
"role": "user",
"content": f"Extract only the section about pricing from this document:\n\n{large_document}"
}]
).content[0].text
# Now send only the relevant section to Sonnet
answer = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"What are the pricing tiers?\n\n{relevant_section}"
}]
)
Frequently asked questions
How do I request a rate limit increase? Rate limits increase automatically when you hit the spend threshold for the next tier. For custom limits above Tier 4, contact Anthropic sales. There's no manual form to request increases on lower tiers — you need to spend your way up.
Are rate limits per API key or per account? Per API key by default. Multiple API keys on the same account share the same limits. You can't increase limits by creating multiple keys.
What happens to requests in flight when the rate limit resets? The rate limit window resets every 60 seconds for TPM. Requests that are in-flight when the window resets count toward the new window, not the old one.
Can I use prompt caching to avoid rate limits?
Yes, partially. Cache reads (cache_read_input_tokens) count toward token limits at a reduced rate. See the prompt caching guide for implementation details.
Is there a daily or monthly token limit? No per-day or per-month hard limits — only per-minute limits. Your monthly spend controls which tier you're on, which controls per-minute limits.
Related guides
- Claude API Error Handling: Production Patterns — comprehensive error handling including rate limit errors
- Claude Prompt Caching: Complete Guide to 90% Cost Reduction — reduce token usage with caching
Take It Further
Claude Agent SDK Cookbook: 40 Production Patterns — Pattern 7 covers the complete Rate Limit Architecture: the TokenBudgetedQueue implementation, proactive monitoring, distributed rate limiting with Redis, and the tier unlock strategy for production scale.
→ Get the Agent SDK Cookbook — $49
30-day money-back guarantee. Instant download.