← All guides

Claude API Error Handling: Rate Limits, Retries, and Production Patterns

Every Claude API error code explained with production retry strategies, exponential backoff implementation, and circuit breaker patterns.

Claude API Error Handling: Rate Limits, Retries, and Production Patterns

The Anthropic API returns structured errors with specific HTTP status codes. Knowing which errors to retry, which to log and surface to users, and which indicate bugs in your code is the difference between a production-ready integration and one that silently fails.

Error code reference

HTTP Status Error type Meaning Action
400 invalid_request_error Malformed request — bad JSON, unsupported parameters, exceeded context window Fix the request — do not retry
401 authentication_error Invalid API key Check key validity — do not retry
403 permission_error Valid key but insufficient permissions (e.g. model not enabled) Check account permissions — do not retry
404 not_found_error Endpoint or model doesn't exist Fix model name or endpoint — do not retry
413 request_too_large Request body exceeds size limit Reduce context or split request
422 unprocessable_entity Request valid but semantically wrong (e.g. invalid tool schema) Fix the schema — do not retry
429 rate_limit_error Too many requests or tokens per minute Retry with exponential backoff
500 api_error Internal server error Retry with backoff, max 3 attempts
529 overloaded_error API overloaded Retry with longer backoff

The critical distinction: 4xx errors (except 429) indicate a problem with your request and should not be retried. 429 and 5xx errors are transient and should be retried.


Rate limit errors (429)

The most common production error. Rate limits are enforced on:

The Retry-After header in the 429 response tells you exactly how many seconds to wait.

Python:

import anthropic
import time

client = anthropic.Anthropic()

def call_with_retry(
    messages: list,
    model: str = "claude-sonnet-4-6",
    max_retries: int = 5,
    base_delay: float = 1.0,
) -> anthropic.types.Message:
    for attempt in range(max_retries):
        try:
            return client.messages.create(
                model=model,
                max_tokens=2048,
                messages=messages,
            )
        except anthropic.RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            # Respect Retry-After header if present
            retry_after = float(
                getattr(e, "response", None) and
                e.response.headers.get("Retry-After", 0) or 0
            )
            wait = max(retry_after, base_delay * (2 ** attempt))
            print(f"Rate limited. Waiting {wait:.1f}s (attempt {attempt + 1}/{max_retries})")
            time.sleep(wait)
        except anthropic.APIStatusError as e:
            if e.status_code >= 500 and attempt < max_retries - 1:
                # 5xx: transient server error, retry
                wait = base_delay * (2 ** attempt)
                print(f"Server error {e.status_code}. Waiting {wait:.1f}s")
                time.sleep(wait)
            else:
                raise  # 4xx or final attempt: re-raise
    raise RuntimeError("Should not reach here")

TypeScript:

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

async function callWithRetry(
  messages: Anthropic.Messages.MessageParam[],
  model = "claude-sonnet-4-6",
  maxRetries = 5,
  baseDelay = 1000
): Promise<Anthropic.Messages.Message> {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await client.messages.create({
        model,
        max_tokens: 2048,
        messages,
      });
    } catch (err) {
      if (err instanceof Anthropic.RateLimitError) {
        if (attempt === maxRetries - 1) throw err;
        const retryAfter = parseInt(err.headers?.["retry-after"] ?? "0") * 1000;
        const wait = Math.max(retryAfter, baseDelay * Math.pow(2, attempt));
        console.log(`Rate limited. Waiting ${wait}ms (attempt ${attempt + 1}/${maxRetries})`);
        await new Promise((r) => setTimeout(r, wait));
        continue;
      }
      if (err instanceof Anthropic.APIError && err.status >= 500) {
        if (attempt === maxRetries - 1) throw err;
        const wait = baseDelay * Math.pow(2, attempt);
        console.log(`Server error ${err.status}. Waiting ${wait}ms`);
        await new Promise((r) => setTimeout(r, wait));
        continue;
      }
      throw err; // 4xx — do not retry
    }
  }
  throw new Error("Max retries exceeded");
}

Context window exceeded (400)

When your input exceeds the model's context window, you get a 400 error:

Error: 400 {"type":"error","error":{"type":"invalid_request_error",
"message":"prompt is too long: 205432 tokens > 200000 maximum"}}

Resolution strategies:

  1. Truncate early messages: for conversations, remove the oldest turns first
  2. Summarize then truncate: use Haiku to summarize the oldest portion, replace with summary
  3. Retrieval instead of full context: use pgvector to retrieve relevant chunks instead of full document
  4. Upgrade to 1M context window: for Sonnet 4.6 or Opus 4.7, request 1M context access

Python — truncate to fit:

def truncate_to_fit(
    messages: list[dict],
    system_prompt: str,
    model: str,
    max_tokens: int = 180_000,  # Leave headroom below 200K
) -> list[dict]:
    """Remove oldest messages until content fits in context window."""
    while len(messages) > 1:
        # Count tokens
        response = client.messages.count_tokens(
            model=model,
            system=system_prompt,
            messages=messages,
        )
        if response.input_tokens <= max_tokens:
            break
        # Remove oldest exchange (user + assistant pair)
        if len(messages) >= 2:
            messages = messages[2:]
        else:
            messages = messages[1:]
    return messages

Streaming errors

Streaming responses can fail mid-stream. Handle both initial connection errors and mid-stream errors:

import httpx

def stream_with_recovery(prompt: str) -> str:
    collected = []
    try:
        with client.messages.stream(
            model="claude-sonnet-4-6",
            max_tokens=2048,
            messages=[{"role": "user", "content": prompt}],
        ) as stream:
            for text in stream.text_stream:
                collected.append(text)
                print(text, end="", flush=True)
        return "".join(collected)
    except anthropic.APIConnectionError as e:
        # Network error mid-stream
        partial = "".join(collected)
        if partial:
            # Re-prompt asking Claude to continue from where it stopped
            print(f"\n[Reconnecting after {len(partial)} chars...]")
            continuation = client.messages.create(
                model="claude-sonnet-4-6",
                max_tokens=2048,
                messages=[
                    {"role": "user", "content": prompt},
                    {"role": "assistant", "content": partial},
                    {"role": "user", "content": "Continue from exactly where you left off."},
                ],
            )
            return partial + continuation.content[0].text
        raise  # No partial content — re-raise

Tool use errors

When a tool raises an error, return the error in the tool result rather than raising in your code. This lets the model reason about the error and retry differently:

def safe_tool_call(tool_name: str, tool_input: dict) -> dict:
    """Always return a tool_result, even on error."""
    try:
        result = dispatch_tool(tool_name, tool_input)
        return {
            "type": "tool_result",
            "tool_use_id": tool_use_id,
            "content": result,
        }
    except Exception as e:
        # Return error as content — model can retry with different params
        return {
            "type": "tool_result",
            "tool_use_id": tool_use_id,
            "content": f"Error: {type(e).__name__}: {e}",
            "is_error": True,
        }

Why this matters: if you raise an exception instead of returning an error tool result, the conversation is broken — the tool_use block exists in the assistant message without a matching tool_result, which is a malformed conversation.


The circuit breaker pattern

For high-volume production systems, wrap your Claude calls with a circuit breaker. After N consecutive failures, stop hitting the API for a cooldown period:

import time
from dataclasses import dataclass, field
from enum import Enum

class CircuitState(Enum):
    CLOSED = "closed"      # Normal operation
    OPEN = "open"          # Failing — reject calls
    HALF_OPEN = "half_open"  # Testing recovery

@dataclass
class CircuitBreaker:
    failure_threshold: int = 5
    recovery_timeout: float = 60.0  # seconds
    state: CircuitState = CircuitState.CLOSED
    failure_count: int = 0
    last_failure_time: float = 0.0

    def call(self, fn, *args, **kwargs):
        if self.state == CircuitState.OPEN:
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = CircuitState.HALF_OPEN
            else:
                raise Exception("Circuit OPEN — Claude API calls suspended")

        try:
            result = fn(*args, **kwargs)
            self._on_success()
            return result
        except (anthropic.RateLimitError, anthropic.APIStatusError) as e:
            if getattr(e, "status_code", 0) >= 500:
                self._on_failure()
            raise

    def _on_success(self):
        self.failure_count = 0
        self.state = CircuitState.CLOSED

    def _on_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        if self.failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN
            print(f"Circuit OPEN after {self.failure_count} failures")

Logging and observability

Log every API call with enough context to debug failures later:

import logging
import time

logger = logging.getLogger("claude_api")

def logged_call(messages: list, model: str = "claude-sonnet-4-6") -> anthropic.types.Message:
    start = time.time()
    try:
        response = client.messages.create(
            model=model,
            max_tokens=2048,
            messages=messages,
        )
        duration_ms = (time.time() - start) * 1000
        logger.info(
            "claude_api.success",
            extra={
                "model": model,
                "input_tokens": response.usage.input_tokens,
                "output_tokens": response.usage.output_tokens,
                "duration_ms": round(duration_ms),
                "stop_reason": response.stop_reason,
            },
        )
        return response
    except anthropic.APIStatusError as e:
        duration_ms = (time.time() - start) * 1000
        logger.error(
            "claude_api.error",
            extra={
                "model": model,
                "status_code": e.status_code,
                "error_type": type(e).__name__,
                "duration_ms": round(duration_ms),
            },
        )
        raise

FAQ

Should I retry on 400 errors? No. A 400 means your request is malformed. Retrying will get the same 400. Fix the request before retrying.

What is the default retry behavior in the SDK? The Anthropic Python and TypeScript SDKs retry 429 and 5xx errors automatically with exponential backoff — 2 retries by default. Configure via max_retries=N in the client constructor.

How do I disable automatic retries?

client = anthropic.Anthropic(max_retries=0)

What happens to in-flight streaming requests when I'm rate limited? A 429 during a stream interrupts the stream. Handle anthropic.RateLimitError in your streaming code and implement the partial-continuation pattern shown above.

How do I test error handling in development? Use httpretty (Python) or nock (Node.js) to mock specific HTTP responses from the Anthropic endpoint.

Sources

  1. Anthropic API error codes — April 2026
  2. Anthropic Python SDK — error handling — April 2026
  3. Anthropic rate limits — April 2026
AI Disclosure: Drafted with Claude Code; verified against Anthropic API documentation April 2026.