Claude API Concurrent Requests and Rate Limit Handling Guide (2026)

The Claude API enforces rate limits by requests per minute (RPM) and tokens per minute (TPM) per tier — running concurrent requests with exponential backoff and a semaphore-based concurrency limit is the correct approach for high-throughput workloads. This guide covers rate limit tiers, parallel request patterns in Python and Node.js, queuing strategies, and the cost math for optimizing throughput without triggering 429 errors.

Understanding Claude API Rate Limits

Anthropic enforces limits at two levels:

Limit Type	What It Controls
Requests per minute (RPM)	How many API calls you can make per minute
Tokens per minute (TPM)	Total input + output tokens per minute
Tokens per day (TPD)	Daily token ceiling (lower tiers)

Rate limit tiers by usage level (approximate — check Anthropic console for your current tier):

Tier	RPM (Sonnet)	TPM (Sonnet)
Free	5	25,000
Build (Tier 1)	50	40,000
Scale (Tier 2)	1,000	80,000
Scale (Tier 3)	2,000	160,000
Scale (Tier 4)	4,000	400,000

Rate limit errors return HTTP 429 with a Retry-After header indicating when to retry.

The Core Pattern: Semaphore + Exponential Backoff

The standard approach for concurrent requests:

Semaphore: Limit max concurrent in-flight requests to stay under RPM
Exponential backoff: When you hit 429, wait and retry with increasing delays
Token budget tracking: Monitor TPM to avoid token-based rate limits

Python Implementation

import anthropic
import asyncio
import time
from typing import Optional

client = anthropic.AsyncAnthropic()

async def call_with_backoff(
    prompt: str,
    semaphore: asyncio.Semaphore,
    max_retries: int = 5
) -> Optional[str]:
    """Single API call with semaphore and exponential backoff."""
    async with semaphore:
        for attempt in range(max_retries):
            try:
                message = await client.messages.create(
                    model="claude-haiku-4-5",
                    max_tokens=1024,
                    messages=[{"role": "user", "content": prompt}]
                )
                return message.content[0].text
            except anthropic.RateLimitError as e:
                if attempt == max_retries - 1:
                    raise
                # Exponential backoff: 1s, 2s, 4s, 8s, 16s
                wait_time = (2 ** attempt) + (0.1 * attempt)
                print(f"Rate limited. Waiting {wait_time:.1f}s (attempt {attempt + 1})")
                await asyncio.sleep(wait_time)
            except anthropic.APIConnectionError:
                if attempt == max_retries - 1:
                    raise
                await asyncio.sleep(1)
    return None


async def process_batch(prompts: list[str], max_concurrent: int = 10) -> list[str]:
    """Process a list of prompts concurrently with rate limit safety."""
    semaphore = asyncio.Semaphore(max_concurrent)
    tasks = [call_with_backoff(p, semaphore) for p in prompts]
    results = await asyncio.gather(*tasks, return_exceptions=True)
    return [r for r in results if isinstance(r, str)]


# Usage
prompts = [f"Summarize topic {i}" for i in range(100)]
results = asyncio.run(process_batch(prompts, max_concurrent=10))
print(f"Processed {len(results)} prompts")

Throughput benchmark: With max_concurrent=10 on Tier 2 (1,000 RPM), this pattern sustains ~900 RPM in practice (10% headroom for safety). At 1,024 output tokens per call, that's ~900K output tokens/minute.

Node.js / TypeScript Concurrent Requests

import Anthropic from '@anthropic-ai/sdk';
import { RateLimitError, APIConnectionError } from '@anthropic-ai/sdk';

const client = new Anthropic();

class Semaphore {
  private permits: number;
  private queue: Array<() => void> = [];

  constructor(permits: number) {
    this.permits = permits;
  }

  async acquire(): Promise<void> {
    if (this.permits > 0) {
      this.permits--;
      return;
    }
    await new Promise<void>(resolve => this.queue.push(resolve));
  }

  release(): void {
    if (this.queue.length > 0) {
      const next = this.queue.shift()!;
      next();
    } else {
      this.permits++;
    }
  }
}

async function callWithBackoff(
  prompt: string,
  semaphore: Semaphore,
  maxRetries = 5
): Promise<string | null> {
  await semaphore.acquire();
  try {
    for (let attempt = 0; attempt < maxRetries; attempt++) {
      try {
        const message = await client.messages.create({
          model: 'claude-haiku-4-5',
          max_tokens: 1024,
          messages: [{ role: 'user', content: prompt }]
        });
        return message.content[0].type === 'text' ? message.content[0].text : null;
      } catch (error) {
        if (error instanceof RateLimitError) {
          if (attempt === maxRetries - 1) throw error;
          const delay = Math.pow(2, attempt) * 1000;
          console.log(`Rate limited. Retrying in ${delay}ms...`);
          await new Promise(resolve => setTimeout(resolve, delay));
        } else if (error instanceof APIConnectionError) {
          if (attempt === maxRetries - 1) throw error;
          await new Promise(resolve => setTimeout(resolve, 1000));
        } else {
          throw error;
        }
      }
    }
    return null;
  } finally {
    semaphore.release();
  }
}

async function processBatch(prompts: string[], maxConcurrent = 10): Promise<string[]> {
  const semaphore = new Semaphore(maxConcurrent);
  const results = await Promise.allSettled(
    prompts.map(p => callWithBackoff(p, semaphore))
  );
  return results
    .filter((r): r is PromiseFulfilledResult<string> => r.status === 'fulfilled' && r.value !== null)
    .map(r => r.value);
}

Cost optimization patterns for high-throughput Claude API workloads

Cost Optimization Guide ($59) covers rate limit navigation, model routing, batch API for 50% discounts, prompt caching strategies, and token budget frameworks.

Get Cost Optimization Guide — $59

Token-Per-Minute (TPM) Tracking

RPM limits are easy to visualize, but TPM limits often bite first with large prompts. Track token usage per window:

import time
from collections import deque

class TokenBudgetTracker:
    """Track tokens used in a rolling 60-second window."""
    
    def __init__(self, tpm_limit: int):
        self.tpm_limit = tpm_limit
        self.usage_window: deque = deque()  # (timestamp, tokens) pairs
    
    def record_usage(self, tokens: int):
        now = time.time()
        self.usage_window.append((now, tokens))
        # Remove entries older than 60 seconds
        while self.usage_window and self.usage_window[0][0] < now - 60:
            self.usage_window.popleft()
    
    def tokens_used_last_minute(self) -> int:
        now = time.time()
        return sum(t for ts, t in self.usage_window if ts > now - 60)
    
    async def wait_if_needed(self, estimated_tokens: int):
        """Wait if adding estimated_tokens would exceed TPM limit."""
        while self.tokens_used_last_minute() + estimated_tokens > self.tpm_limit * 0.9:
            print(f"TPM budget at {self.tokens_used_last_minute()}/{self.tpm_limit}. Waiting...")
            await asyncio.sleep(5)


# Usage with tracker
tracker = TokenBudgetTracker(tpm_limit=80_000)  # Tier 2

async def call_with_budget(prompt: str, semaphore: asyncio.Semaphore):
    estimated = len(prompt.split()) * 1.3 + 1024  # Rough token estimate
    await tracker.wait_if_needed(int(estimated))
    
    async with semaphore:
        response = await client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=1024,
            messages=[{"role": "user", "content": prompt}]
        )
        actual_tokens = response.usage.input_tokens + response.usage.output_tokens
        tracker.record_usage(actual_tokens)
        return response.content[0].text

Using the Anthropic Batch API for High Volume

For workloads where latency doesn't matter (overnight processing, bulk enrichment), the Message Batches API is the right tool:

import anthropic

client = anthropic.Anthropic()

# Submit up to 10,000 requests in a single batch
batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": f"request-{i}",
            "params": {
                "model": "claude-haiku-4-5",
                "max_tokens": 1024,
                "messages": [{"role": "user", "content": f"Classify sentiment: '{text}'"}]
            }
        }
        for i, text in enumerate(texts)
    ]
)

print(f"Batch ID: {batch.id}")
print(f"Status: {batch.processing_status}")

# Poll for completion
import time
while True:
    result = client.messages.batches.retrieve(batch.id)
    if result.processing_status == "ended":
        break
    print(f"Processing... {result.request_counts.processing} remaining")
    time.sleep(60)

# Retrieve results
for result in client.messages.batches.results(batch.id):
    print(f"{result.custom_id}: {result.result.message.content[0].text[:100]}")

Cost impact: Batch API costs 50% less than synchronous calls. For 10,000 Haiku calls at 500 tokens average, synchronous cost ≈ $0.25; batch cost ≈ $0.125. At scale (1M calls/month), batch API saves ~$12,500/month. See Claude API Cost and Prompt Caching Break-Even for the full pricing model.

Optimizing Throughput: Model Routing

Not every concurrent request needs the same model. Route by task complexity to maximize throughput within your token budget:

from enum import Enum

class TaskType(Enum):
    CLASSIFICATION = "classification"      # Haiku
    SUMMARIZATION = "summarization"        # Haiku or Sonnet
    CODE_GENERATION = "code_generation"    # Sonnet
    COMPLEX_REASONING = "complex"          # Opus

MODEL_MAP = {
    TaskType.CLASSIFICATION: "claude-haiku-4-5",
    TaskType.SUMMARIZATION: "claude-haiku-4-5",
    TaskType.CODE_GENERATION: "claude-sonnet-4-5",
    TaskType.COMPLEX_REASONING: "claude-opus-4-5"
}

# Haiku is ~10x cheaper than Sonnet and ~50x cheaper than Opus
# Routing 80% of requests to Haiku dramatically increases sustainable RPM

This 80/15/5 routing (80% Haiku, 15% Sonnet, 5% Opus) is the core cost optimization principle. Each Haiku call uses ~3x fewer tokens than Sonnet for equivalent output on simple tasks, meaning your TPM budget stretches 3x further.

For the full model comparison and when to use each, see Claude Haiku vs Sonnet vs Opus: Which Model to Use.

Handling 429 Errors: What the Headers Tell You

When you receive a 429, the response headers contain critical information:

try:
    response = client.messages.create(...)
except anthropic.RateLimitError as e:
    # e.response contains the raw HTTP response
    retry_after = e.response.headers.get('retry-after')
    remaining_rpm = e.response.headers.get('anthropic-ratelimit-requests-remaining')
    reset_time = e.response.headers.get('anthropic-ratelimit-requests-reset')
    
    print(f"Retry after: {retry_after}s")
    print(f"Requests remaining: {remaining_rpm}")
    print(f"Limit resets at: {reset_time}")
    
    if retry_after:
        await asyncio.sleep(float(retry_after) + 0.5)  # Add 500ms buffer

Always honor the retry-after header instead of using a fixed backoff — it's the most accurate signal of when your limit resets.

Frequently Asked Questions

What is the Claude API rate limit and how do I check my current tier?

Rate limits vary by tier based on your spending history. Check your current limits in the Anthropic Console under Settings > Limits. New accounts start at Tier 1 (50 RPM). Limits increase automatically as you spend more, or you can request increases via the console for production workloads.

How many concurrent requests can I safely make to the Claude API?

Set max concurrent requests to 80–90% of your RPM limit divided by 60 (to convert per-minute to per-second). For Tier 2 (1,000 RPM): 1000 / 60 * 0.85 ≈ 14 max concurrent requests. Always leave a 10–15% buffer to absorb bursts without hitting 429s.

What's the difference between RPM and TPM rate limits?

RPM (requests per minute) limits how many API calls you can make, regardless of size. TPM (tokens per minute) limits the total token volume. TPM is often the binding constraint for large prompts. A single 50,000-token request uses 62.5% of an 80,000 TPM budget. Monitor both independently.

Does the Anthropic Batch API have the same rate limits?

Batch API requests don't count against your synchronous RPM/TPM limits. Batches are processed asynchronously and have their own queue system. The trade-off is latency — batches can take minutes to hours to complete. For time-sensitive requests, use the synchronous API with proper concurrency management.

How do I avoid token-per-minute (TPM) rate limits for large documents?

Use prompt caching for repeated context (large documents or system prompts) — cached tokens count at only 10% of their size against TPM. Split large documents into chunks and process sequentially. Use Haiku instead of Sonnet where possible (same TPM budget, same number of calls, lower cost). See the Claude API Cost Guide for caching ROI calculations.

What should I do if my application needs more throughput than my current tier allows?

First, apply model routing (Haiku for simple tasks). Second, enable prompt caching. Third, use Batch API for non-time-sensitive work. If you still need more synchronous throughput, request a rate limit increase in the Anthropic Console (requires demonstrating consistent usage and payment history). Enterprise plans have negotiated limits.

Full cost and throughput optimization framework

Cost Optimization Guide ($59) includes complete throughput calculators, batch API workflows, model routing decision trees, and prompt caching ROI models.

Get Cost Optimization Guide — $59