← All guides

Claude API Rate Limits: Tier 1-4, 429 Recovery (2026)

Claude API rate limits: Tier 1 = 50 RPM / 50K TPM / $100 daily. What triggers 429, retry-after backoff code, and how to reach Tier 4 in exactly 14 days.

Claude API Rate Limits: Tier 1-4, 429 Recovery, Retry-After (2026)

Anthropic enforces 4 independent counters per organization: RPM (50 starting), input TPM (50K), output TPM (10K), and a $100 daily spend cap β€” hit any one and the API returns 429. Anthropic enforces RPM (requests per minute) and TPM (tokens per minute) limits per organization, scaled by usage tier. A 429 Too Many Requests is returned when any of those four counters β€” RPM, input TPM, output TPM, or daily dollar cap β€” is exceeded. The correct recovery is exponential backoff that honors the retry-after header, not a fixed sleep. Tier upgrades are automatic, triggered by both cumulative spend and days since first payment β€” you can't just throw $400 at the API and immediately land on Tier 4.

This guide unpacks every limit, shows production-grade retry code, and explains the burst patterns that keep throughput high without tripping 429s.

The four limits that matter

Most people think "rate limit" means RPM. Anthropic enforces four independent counters, and you hit 429 the moment any one trips:

  1. RPM β€” requests per minute. Each API call counts as 1, regardless of size.
  2. Input TPM β€” input tokens per minute. Includes system prompt, user message, tool definitions, prior assistant turns, and any cached prefix. Tool-heavy agents burn input TPM faster than chat because every tool result re-enters as input on the next turn.
  3. Output TPM β€” output tokens per minute. The one that bites long-form generation.
  4. Daily $ cap (Tier 1 and 2 only) β€” hard ceiling on cumulative spend per UTC day. Resets at 00:00 UTC, not local midnight.

Output TPM is the most common surprise: a single 4K-token response on Tier 1 burns 40% of your minute. Plan for whichever limit your workload hits first, not the headline RPM number.

Tier table (Tier 1 through Tier 4)

Tier Spend req Days req RPM In TPM Out TPM Daily $
Tier 1 $0 0 50 50K 10K $100
Tier 2 $40 7 1,000 100K 20K $500
Tier 3 $200 7 2,000 200K 40K $1,000
Tier 4 $400 14 4,000 400K 80K $5,000

Tier 4 removes the daily cap entirely. Above Tier 4 ("custom tier") requires sales contact.

The days requirement catches teams. You can spend $400 in week one, but Tier 4 needs 14 days since first payment β€” so the earliest you reach it is day 14 with $400 cumulative spend. If your launch needs Tier 4, start spending real money 2-3 weeks ahead.

Anatomy of a 429 response

When you trip a limit, Anthropic returns:

HTTP/1.1 429 Too Many Requests
retry-after: 23
anthropic-ratelimit-requests-limit: 1000
anthropic-ratelimit-requests-remaining: 0
anthropic-ratelimit-requests-reset: 2026-05-09T14:23:45Z
anthropic-ratelimit-input-tokens-limit: 100000
anthropic-ratelimit-input-tokens-remaining: 0
anthropic-ratelimit-input-tokens-reset: 2026-05-09T14:23:12Z
anthropic-ratelimit-tokens-limit: 100000
anthropic-ratelimit-tokens-remaining: 0
anthropic-ratelimit-tokens-reset: 2026-05-09T14:23:12Z

{"type":"error","error":{"type":"rate_limit_error","message":"..."}}

The two things to read every time:

There are three reset timestamps, not one β€” RPM, input TPM, and output TPM reset on independent rolling windows. Read all three before deciding when to retry.

Exponential backoff that actually works (TypeScript)

The naive await sleep(1000 * 2 ** attempt) ignores retry-after and synchronizes all your retries β€” bad. Production code honors the header, adds jitter, and caps total attempts:

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

async function callWithBackoff<T>(
  fn: () => Promise<T>,
  maxAttempts = 6,
): Promise<T> {
  let attempt = 0;
  while (true) {
    try {
      return await fn();
    } catch (err: any) {
      attempt++;
      const status = err?.status ?? err?.response?.status;
      const isRetryable = status === 429 || status === 529 || status >= 500;
      if (!isRetryable || attempt >= maxAttempts) throw err;

      // 1. Honor retry-after if present (seconds)
      const retryAfter = Number(err?.headers?.["retry-after"]);
      // 2. Otherwise exponential: 1s, 2s, 4s, 8s, 16s, 32s
      const base = Number.isFinite(retryAfter)
        ? retryAfter * 1000
        : Math.min(32_000, 1000 * 2 ** (attempt - 1));
      // 3. Full jitter β€” randomize in [0, base] to de-sync retries
      const sleep = Math.random() * base;
      await new Promise((r) => setTimeout(r, sleep));
    }
  }
}

// Usage
const msg = await callWithBackoff(() =>
  client.messages.create({
    model: "claude-sonnet-4-6-20251120",
    max_tokens: 1024,
    messages: [{ role: "user", content: "Hello" }],
  }),
);

Three things make this production-ready: retry-after priority, full jitter (provably better at scale than equal jitter because it removes correlation between retrying clients), and a hard attempt cap so an outage doesn't spin forever. Add a circuit breaker and a retry-rate metric for full production hygiene.

Burst patterns: token bucket beats naive concurrency

If you fire 100 requests in parallel from a Tier 2 account, you'll hit 429 on roughly 95 of them within a few hundred milliseconds. The fix is a client-side token bucket that rate-limits before the request leaves your process:

import asyncio, time
from anthropic import AsyncAnthropic

class TokenBucket:
    def __init__(self, rate_per_min: int):
        self.capacity = rate_per_min
        self.tokens = float(rate_per_min)
        self.refill_per_sec = rate_per_min / 60.0
        self.last = time.monotonic()
        self.lock = asyncio.Lock()

    async def acquire(self, n: int = 1):
        while True:
            async with self.lock:
                now = time.monotonic()
                self.tokens = min(
                    self.capacity,
                    self.tokens + (now - self.last) * self.refill_per_sec,
                )
                self.last = now
                if self.tokens >= n:
                    self.tokens -= n
                    return
                wait = (n - self.tokens) / self.refill_per_sec
            await asyncio.sleep(wait)

# Tier 2: 1000 RPM, 100K input TPM
rpm_bucket = TokenBucket(1000)
tpm_bucket = TokenBucket(100_000)

client = AsyncAnthropic()

async def call(prompt: str):
    estimated_input = len(prompt) // 4  # rough char→token
    await rpm_bucket.acquire(1)
    await tpm_bucket.acquire(estimated_input)
    return await client.messages.create(
        model="claude-sonnet-4-6-20251120",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}],
    )

Two buckets, one for RPM and one for TPM, gated before the network call. Your 100 parallel calls drain at the refill rate β€” no 429s, predictable throughput, stable p99 latency. A production version adds a third bucket for output TPM (estimate via max_tokens) and reconciles each bucket against the response's actual usage fields to prevent drift.


Stop guessing your Claude API budget. The Cost Optimization Masterclass walks through tier strategy, prompt caching, batch routing, and the exact spreadsheet I use to forecast monthly spend before scaling traffic. One overspend month pays for it ten times over.


Long-running batches: use the Batch API

If your job isn't latency-sensitive, don't fight rate limits β€” use the Message Batches API: 50% cheaper on both input and output, no synchronous rate limit, up to 10,000 requests per batch processed within 24 hours (usually under an hour). Heuristic: if a human won't see the output within 60 seconds, route it to batch. Combining batch with prompt caching can cut token costs 60-70% versus synchronous calls. See the cost monitoring guide for automatic routing based on a per-request latency_budget field.

Multi-key strategy: real, with caveats

Multiple API keys in one organization each get their own RPM counter, but share the same TPM and daily $ budget. So multi-key buys parallelism for request-bound workloads but does nothing for token-bound ones. Don't use multi-key to bypass tier limits β€” Anthropic's enforcement is org-wide on tokens and spend.

Legitimate use: separating keys per workspace so one runaway script doesn't exhaust RPM for production traffic. One nuance β€” rotating keys does not reset rate limits; if you've burned through TPM, a new key won't help.

Tier upgrade triggers and manual requests

Upgrades are automatic β€” the system checks daily and promotes you when both conditions are true: cumulative spend β‰₯ tier threshold and days since first payment β‰₯ tier minimum. No button to push.

For enterprise volume or pre-launch teams with funded runway, email Anthropic support with: org ID, expected monthly RPM/TPM, use case description, billing contact. Manual upgrades resolve in 1-3 business days. YC and Anthropic for Startups members can often skip Tier 1 entirely.

Five common pitfalls

  1. Fast retries with no jitter. Ten clients retrying at t+1s cause a thundering herd. Use full jitter (random * base), not equal jitter.
  2. Ignoring retry-after. Your guess is worse than the server's. Read it first; fall back to exponential math only when missing.
  3. Not honoring the daily cap. Tier 1/2 have a $100/$500 ceiling β€” a runaway loop with max_tokens: 8192 on Opus can blow it in under a minute. Add a client-side kill-switch at 80% of cap.
  4. Treating workspaces as separate budgets. Workspaces are observability boundaries; rate limits and dollar caps are organization-wide.
  5. Forgetting streaming counts. Streamed and non-streamed output consume identical TPM. A client that disconnects mid-stream still pays for tokens already generated. See error handling patterns for streaming-specific 429 recovery and the overloaded_error 529 (distinct from a rate limit).

Frequently Asked Questions

What's the fastest way to get to Tier 4?

Minimum is 14 days from first payment with $400+ cumulative spend. The day requirement is the binding constraint β€” you can't shortcut it by spending faster. If you need Tier 4 by date X, make your first paid call by X-14. For enterprise volume, email support for a manual promotion.

Do streaming responses count toward TPM differently?

No. Streamed and non-streamed completions consume identical input/output tokens β€” streaming is just a delivery mode. The only difference: a client that disconnects mid-stream still pays for tokens already generated.

Can I share rate limit across multiple keys?

Sort of. Multiple keys in the same organization each get independent RPM counters (so you can parallelize request-bound workloads across keys), but they share the organization-wide TPM and daily spend cap. Multi-key is a parallelism tool, not a tier-upgrade workaround.

How do I monitor my current rate?

Read anthropic-ratelimit-*-remaining on every successful response β€” free telemetry. Pipe them to Datadog/Prometheus/PostHog and alert at 20% remaining. The Anthropic Console also shows real-time RPM/TPM graphs per key.

What about prompt caching β€” does cache_read count toward TPM?

Yes, but at a discount. Cache reads consume the same TPM budget as fresh input tokens, just at 10% the cost. For TPM purposes a cached prefix and a fresh prefix are equivalent β€” both eat your input TPM. Caching is a cost and latency lever, not a rate-limit lever. If you're hitting input TPM constantly, caching won't save you β€” you need a tier upgrade or a token bucket. See the prompt caching guide for which workloads benefit.

The one-line takeaway

Read the rate-limit headers on every response, honor retry-after on every 429, gate your client with a token bucket, and route long-running work to the Batch API. Do those four things and you'll never see a rate-limit incident in production.

Claude API Cost Optimization Masterclass ($59) β€” Rate limits cap your throughput; token costs cap your budget. The masterclass covers model tiering, prompt caching, and the 80/15/5 rule to keep both under control β€” PDF guide with an Excel cost calculator.

AI Disclosure: Drafted with Claude Code; rate limit numbers from Anthropic public docs as of May 2026.

Tools and references