← All guides

Claude API Rate Limits: Tiers, Headers, and Retry Strategies (2026)

Claude API rate limits by tier, what the response headers tell you, and the retry patterns that keep production systems running when you hit them.

Claude API Rate Limits: Tiers, Headers, and Retry Strategies (2026)

The Claude API enforces rate limits on tokens per minute (TPM) and requests per minute (RPM) per API key. Understanding the limit structure — and what the response headers tell you — is the difference between a production system that degrades gracefully and one that goes down.

Rate limit tiers

Anthropic uses a tier system based on account age and spend history. Limits increase automatically as you meet the thresholds.

Usage Tier 1 (new accounts)

Granted on signup with a valid credit card.

Model RPM TPM TPD
claude-3-5-haiku 50 50,000 5M
claude-3-5-sonnet 50 40,000 4M
claude-3-7-sonnet 50 40,000 4M
claude-opus-4 50 20,000 2M

Usage Tier 2

Requires $0–$100 spend, account age ≥7 days.

Model RPM TPM TPD
claude-3-5-haiku 1,000 200,000
claude-3-5-sonnet 1,000 160,000
claude-3-7-sonnet 1,000 160,000

Usage Tier 3

Requires $100+ spend, account age ≥14 days.

Model RPM TPM
claude-3-5-haiku 2,000 400,000
claude-3-5-sonnet 2,000 320,000

Tier 4 and above

Requires $500–$1,000+ spend. Contact Anthropic for enterprise limits beyond Tier 4.

Check your current tier at console.anthropic.com → Settings → Limits.


Rate limit response headers

Every API response includes headers that tell you your current rate limit state. Read these before you need them:

anthropic-ratelimit-requests-limit: 1000
anthropic-ratelimit-requests-remaining: 999
anthropic-ratelimit-requests-reset: 2026-04-26T12:01:00Z

anthropic-ratelimit-tokens-limit: 200000
anthropic-ratelimit-tokens-remaining: 187432
anthropic-ratelimit-tokens-reset: 2026-04-26T12:01:00Z

anthropic-ratelimit-input-tokens-limit: 150000
anthropic-ratelimit-input-tokens-remaining: 143000
anthropic-ratelimit-input-tokens-reset: 2026-04-26T12:01:00Z

anthropic-ratelimit-output-tokens-limit: 50000
anthropic-ratelimit-output-tokens-remaining: 44432
anthropic-ratelimit-output-tokens-reset: 2026-04-26T12:01:00Z

retry-after: 5

The retry-after header appears only on 429 responses. It tells you exactly how many seconds to wait before retrying.

Reading headers with the SDK

The Python SDK exposes headers on the response:

response = client.messages.with_raw_response.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello"}]
)

headers = response.headers
remaining_requests = int(headers.get("anthropic-ratelimit-requests-remaining", 0))
remaining_tokens = int(headers.get("anthropic-ratelimit-tokens-remaining", 0))

print(f"Requests remaining: {remaining_requests}")
print(f"Tokens remaining: {remaining_tokens}")

message = response.parse()
print(message.content[0].text)

When you hit a rate limit: the error

{
  "type": "error",
  "error": {
    "type": "rate_limit_error",
    "message": "Rate limit exceeded. Please try again in 30 seconds."
  }
}

HTTP status: 429 Too Many Requests

This is recoverable. The correct response is to wait and retry — not to fail permanently.


Retry patterns

Exponential backoff with jitter (the standard)

import anthropic
import time
import random

client = anthropic.Anthropic()

def call_with_backoff(messages, model="claude-3-5-sonnet-20241022", max_retries=5):
    """Retry with exponential backoff + jitter on rate limit errors."""
    for attempt in range(max_retries):
        try:
            return client.messages.create(
                model=model,
                max_tokens=1024,
                messages=messages,
            )
        except anthropic.RateLimitError as e:
            if attempt == max_retries - 1:
                raise  # Give up after max_retries
            
            # Exponential backoff: 2^attempt seconds + jitter
            wait = (2 ** attempt) + random.uniform(0, 1)
            print(f"Rate limited. Retrying in {wait:.1f}s (attempt {attempt + 1}/{max_retries})")
            time.sleep(wait)
        except anthropic.APIStatusError as e:
            if e.status_code == 529:  # Overloaded
                wait = (2 ** attempt) + random.uniform(0, 1)
                time.sleep(wait)
            else:
                raise

Using retry-after header

When you have the exact wait time from the header:

import anthropic
import time
import httpx

def call_with_retry_after(messages):
    """Use the retry-after header for precise wait time."""
    while True:
        try:
            response = client.messages.with_raw_response.create(
                model="claude-3-5-sonnet-20241022",
                max_tokens=1024,
                messages=messages,
            )
            return response.parse()
        except anthropic.RateLimitError as e:
            # The SDK wraps the response — extract retry-after
            retry_after = 30  # default fallback
            if hasattr(e, 'response') and e.response:
                retry_after = int(e.response.headers.get("retry-after", 30))
            print(f"Rate limited. Waiting {retry_after}s...")
            time.sleep(retry_after + 1)  # +1 for safety

The SDK's built-in retry

The official SDK retries automatically on 429s and 529s with exponential backoff. The default is 2 retries:

# Increase built-in retries
client = anthropic.Anthropic(
    max_retries=4,
    timeout=httpx.Timeout(60.0, connect=5.0)
)

The SDK's built-in retry is sufficient for most cases. Add your own layer only when you need custom logic (logging, fallback models, circuit breaking).


Staying under limits proactively

Monitor remaining tokens before large requests

def safe_large_request(large_prompt, threshold=0.2):
    """Only send if >20% of token budget remains."""
    # Check headroom with a lightweight call
    response = client.messages.with_raw_response.create(
        model="claude-3-5-haiku-20241022",
        max_tokens=1,
        messages=[{"role": "user", "content": "ping"}]
    )
    remaining = int(response.headers.get("anthropic-ratelimit-tokens-remaining", 0))
    limit = int(response.headers.get("anthropic-ratelimit-tokens-limit", 1))
    
    if remaining / limit < threshold:
        raise RuntimeError(f"Token budget low ({remaining}/{limit}). Waiting for reset.")
    
    return client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=4096,
        messages=[{"role": "user", "content": large_prompt}]
    )

Parallel requests and burst limits

Rate limits apply per minute across all requests from the same API key. If you send 50 parallel requests all at once, they can collectively exhaust your TPM budget in seconds.

Fix: Use a semaphore to cap concurrency:

import asyncio
import anthropic

client = anthropic.AsyncAnthropic()

async def process_batch(items, max_concurrent=10):
    semaphore = asyncio.Semaphore(max_concurrent)
    
    async def process_one(item):
        async with semaphore:
            return await client.messages.create(
                model="claude-3-5-haiku-20241022",
                max_tokens=512,
                messages=[{"role": "user", "content": item}]
            )
    
    return await asyncio.gather(*[process_one(item) for item in items])

For very large batch workloads (>1000 requests), use the Batch API instead — it doesn't count against your rate limits and gets a 50% price discount.


Rate limits vs. overload errors

Two different errors look similar but require different handling:

Error Status Cause Fix
rate_limit_error 429 You exceeded your tier limit Wait for reset (check retry-after)
overloaded_error 529 Anthropic servers are busy Retry with backoff

Rate limits are your fault (use your budget more efficiently or upgrade tier).
Overload is Anthropic's problem (retry, they'll recover in seconds to minutes).

Both are transient — retry, don't fail permanently.


Getting higher limits

If you consistently hit rate limits:

  1. Wait for automatic tier upgrades — limits increase automatically with spending history
  2. Use multiple API keys — limits are per key; distribute load across keys
  3. Switch to Batch API — for async workloads, batch requests bypass real-time rate limits
  4. Contact Anthropic — enterprise accounts get custom limits

For high-volume workloads, the Batch API is often the right answer. See the full batch API guide for implementation.


For the full production error taxonomy — all error types, when they're retryable, and the exact retry strategy for each — see the Claude API error handling guide.

Drafted with Claude Code. Rate limit tiers verified against platform.claude.com documentation as of 2026-04-26. Limits are subject to change — check console.anthropic.com for your current limits.

See also


Frequently Asked Questions

What is the difference between RPM, TPM, and TPD rate limits?

RPM (requests per minute) limits how many API calls you can make per minute regardless of token size. TPM (tokens per minute) limits the total token volume — input plus output — processed per minute. TPD (tokens per day) caps total daily token usage and appears only for lower tiers. Hitting any one of these three limits triggers a 429 response.

How do I check which rate limit tier my account is on?

Go to console.anthropic.com → Settings → Limits. Your current tier, the limits for each model, and your spend history are all displayed there. Tiers upgrade automatically once you meet the spend and account-age thresholds — there is no manual approval step.

What is the difference between a 429 rate_limit_error and a 529 overloaded_error?

A 429 means you personally exhausted your quota — you need to wait for your rate limit window to reset before retrying. A 529 means Anthropic's servers are under load globally — it's not your quota, and retrying with backoff will succeed once the load drops, typically within seconds to minutes.

Does using the Batch API count against my real-time rate limits?

No. Batch API requests are processed asynchronously and are not counted against your real-time RPM or TPM limits. This is one of the main reasons to use the Batch API for high-volume async workloads — it avoids rate limit contention with your interactive traffic.

Can I use multiple API keys to bypass rate limits?

Yes. Rate limits are enforced per API key, not per account. You can create multiple keys in the Anthropic console and distribute requests across them. Each key gets the full Tier N limits independently. This is a common pattern for teams running parallel workloads that saturate a single key's TPM budget.


Take It Further

Claude API Cost Optimization Masterclass — Cut your Claude API bill by 60–90% without sacrificing quality. 12 production deployments analyzed. The concrete order-of-operations: prompt caching, model tiering, Batch API, token compression.

120-page PDF + 6-sheet Excel cost calculator. Real results: $2,100 → $187/month on customer support agent.

→ Get Cost Optimization Masterclass — $59

30-day money-back guarantee. Instant download.

AI Disclosure: Drafted with Claude Code. Rate limits verified against Anthropic platform documentation as of 2026-04-26.

Tools and references