Claude API Error Handling: Rate Limits, Retries, and Production Patterns
The Anthropic API returns structured errors with specific HTTP status codes. Knowing which errors to retry, which to log and surface to users, and which indicate bugs in your code is the difference between a production-ready integration and one that silently fails.
Error code reference
| HTTP Status | Error type | Meaning | Action |
|---|---|---|---|
| 400 | invalid_request_error |
Malformed request — bad JSON, unsupported parameters, exceeded context window | Fix the request — do not retry |
| 401 | authentication_error |
Invalid API key | Check key validity — do not retry |
| 403 | permission_error |
Valid key but insufficient permissions (e.g. model not enabled) | Check account permissions — do not retry |
| 404 | not_found_error |
Endpoint or model doesn't exist | Fix model name or endpoint — do not retry |
| 413 | request_too_large |
Request body exceeds size limit | Reduce context or split request |
| 422 | unprocessable_entity |
Request valid but semantically wrong (e.g. invalid tool schema) | Fix the schema — do not retry |
| 429 | rate_limit_error |
Too many requests or tokens per minute | Retry with exponential backoff |
| 500 | api_error |
Internal server error | Retry with backoff, max 3 attempts |
| 529 | overloaded_error |
API overloaded | Retry with longer backoff |
The critical distinction: 4xx errors (except 429) indicate a problem with your request and should not be retried. 429 and 5xx errors are transient and should be retried.
Rate limit errors (429)
The most common production error. Rate limits are enforced on:
- Requests per minute (RPM): number of API calls
- Input tokens per minute (ITPM): total input tokens
- Output tokens per minute (OTPM): total output tokens
The Retry-After header in the 429 response tells you exactly how many seconds to wait.
Python:
import anthropic
import time
client = anthropic.Anthropic()
def call_with_retry(
messages: list,
model: str = "claude-sonnet-4-6",
max_retries: int = 5,
base_delay: float = 1.0,
) -> anthropic.types.Message:
for attempt in range(max_retries):
try:
return client.messages.create(
model=model,
max_tokens=2048,
messages=messages,
)
except anthropic.RateLimitError as e:
if attempt == max_retries - 1:
raise
# Respect Retry-After header if present
retry_after = float(
getattr(e, "response", None) and
e.response.headers.get("Retry-After", 0) or 0
)
wait = max(retry_after, base_delay * (2 ** attempt))
print(f"Rate limited. Waiting {wait:.1f}s (attempt {attempt + 1}/{max_retries})")
time.sleep(wait)
except anthropic.APIStatusError as e:
if e.status_code >= 500 and attempt < max_retries - 1:
# 5xx: transient server error, retry
wait = base_delay * (2 ** attempt)
print(f"Server error {e.status_code}. Waiting {wait:.1f}s")
time.sleep(wait)
else:
raise # 4xx or final attempt: re-raise
raise RuntimeError("Should not reach here")
TypeScript:
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
async function callWithRetry(
messages: Anthropic.Messages.MessageParam[],
model = "claude-sonnet-4-6",
maxRetries = 5,
baseDelay = 1000
): Promise<Anthropic.Messages.Message> {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await client.messages.create({
model,
max_tokens: 2048,
messages,
});
} catch (err) {
if (err instanceof Anthropic.RateLimitError) {
if (attempt === maxRetries - 1) throw err;
const retryAfter = parseInt(err.headers?.["retry-after"] ?? "0") * 1000;
const wait = Math.max(retryAfter, baseDelay * Math.pow(2, attempt));
console.log(`Rate limited. Waiting ${wait}ms (attempt ${attempt + 1}/${maxRetries})`);
await new Promise((r) => setTimeout(r, wait));
continue;
}
if (err instanceof Anthropic.APIError && err.status >= 500) {
if (attempt === maxRetries - 1) throw err;
const wait = baseDelay * Math.pow(2, attempt);
console.log(`Server error ${err.status}. Waiting ${wait}ms`);
await new Promise((r) => setTimeout(r, wait));
continue;
}
throw err; // 4xx — do not retry
}
}
throw new Error("Max retries exceeded");
}
Context window exceeded (400)
When your input exceeds the model's context window, you get a 400 error:
Error: 400 {"type":"error","error":{"type":"invalid_request_error",
"message":"prompt is too long: 205432 tokens > 200000 maximum"}}
Resolution strategies:
- Truncate early messages: for conversations, remove the oldest turns first
- Summarize then truncate: use Haiku to summarize the oldest portion, replace with summary
- Retrieval instead of full context: use pgvector to retrieve relevant chunks instead of full document
- Upgrade to 1M context window: for Sonnet 4.6 or Opus 4.7, request 1M context access
Python — truncate to fit:
def truncate_to_fit(
messages: list[dict],
system_prompt: str,
model: str,
max_tokens: int = 180_000, # Leave headroom below 200K
) -> list[dict]:
"""Remove oldest messages until content fits in context window."""
while len(messages) > 1:
# Count tokens
response = client.messages.count_tokens(
model=model,
system=system_prompt,
messages=messages,
)
if response.input_tokens <= max_tokens:
break
# Remove oldest exchange (user + assistant pair)
if len(messages) >= 2:
messages = messages[2:]
else:
messages = messages[1:]
return messages
Streaming errors
Streaming responses can fail mid-stream. Handle both initial connection errors and mid-stream errors:
import httpx
def stream_with_recovery(prompt: str) -> str:
collected = []
try:
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=2048,
messages=[{"role": "user", "content": prompt}],
) as stream:
for text in stream.text_stream:
collected.append(text)
print(text, end="", flush=True)
return "".join(collected)
except anthropic.APIConnectionError as e:
# Network error mid-stream
partial = "".join(collected)
if partial:
# Re-prompt asking Claude to continue from where it stopped
print(f"\n[Reconnecting after {len(partial)} chars...]")
continuation = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
messages=[
{"role": "user", "content": prompt},
{"role": "assistant", "content": partial},
{"role": "user", "content": "Continue from exactly where you left off."},
],
)
return partial + continuation.content[0].text
raise # No partial content — re-raise
Tool use errors
When a tool raises an error, return the error in the tool result rather than raising in your code. This lets the model reason about the error and retry differently:
def safe_tool_call(tool_name: str, tool_input: dict) -> dict:
"""Always return a tool_result, even on error."""
try:
result = dispatch_tool(tool_name, tool_input)
return {
"type": "tool_result",
"tool_use_id": tool_use_id,
"content": result,
}
except Exception as e:
# Return error as content — model can retry with different params
return {
"type": "tool_result",
"tool_use_id": tool_use_id,
"content": f"Error: {type(e).__name__}: {e}",
"is_error": True,
}
Why this matters: if you raise an exception instead of returning an error tool result, the conversation is broken — the tool_use block exists in the assistant message without a matching tool_result, which is a malformed conversation.
The circuit breaker pattern
For high-volume production systems, wrap your Claude calls with a circuit breaker. After N consecutive failures, stop hitting the API for a cooldown period:
import time
from dataclasses import dataclass, field
from enum import Enum
class CircuitState(Enum):
CLOSED = "closed" # Normal operation
OPEN = "open" # Failing — reject calls
HALF_OPEN = "half_open" # Testing recovery
@dataclass
class CircuitBreaker:
failure_threshold: int = 5
recovery_timeout: float = 60.0 # seconds
state: CircuitState = CircuitState.CLOSED
failure_count: int = 0
last_failure_time: float = 0.0
def call(self, fn, *args, **kwargs):
if self.state == CircuitState.OPEN:
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = CircuitState.HALF_OPEN
else:
raise Exception("Circuit OPEN — Claude API calls suspended")
try:
result = fn(*args, **kwargs)
self._on_success()
return result
except (anthropic.RateLimitError, anthropic.APIStatusError) as e:
if getattr(e, "status_code", 0) >= 500:
self._on_failure()
raise
def _on_success(self):
self.failure_count = 0
self.state = CircuitState.CLOSED
def _on_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
print(f"Circuit OPEN after {self.failure_count} failures")
Logging and observability
Log every API call with enough context to debug failures later:
import logging
import time
logger = logging.getLogger("claude_api")
def logged_call(messages: list, model: str = "claude-sonnet-4-6") -> anthropic.types.Message:
start = time.time()
try:
response = client.messages.create(
model=model,
max_tokens=2048,
messages=messages,
)
duration_ms = (time.time() - start) * 1000
logger.info(
"claude_api.success",
extra={
"model": model,
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"duration_ms": round(duration_ms),
"stop_reason": response.stop_reason,
},
)
return response
except anthropic.APIStatusError as e:
duration_ms = (time.time() - start) * 1000
logger.error(
"claude_api.error",
extra={
"model": model,
"status_code": e.status_code,
"error_type": type(e).__name__,
"duration_ms": round(duration_ms),
},
)
raise
FAQ
Should I retry on 400 errors? No. A 400 means your request is malformed. Retrying will get the same 400. Fix the request before retrying.
What is the default retry behavior in the SDK?
The Anthropic Python and TypeScript SDKs retry 429 and 5xx errors automatically with exponential backoff — 2 retries by default. Configure via max_retries=N in the client constructor.
How do I disable automatic retries?
client = anthropic.Anthropic(max_retries=0)
What happens to in-flight streaming requests when I'm rate limited?
A 429 during a stream interrupts the stream. Handle anthropic.RateLimitError in your streaming code and implement the partial-continuation pattern shown above.
How do I test error handling in development?
Use httpretty (Python) or nock (Node.js) to mock specific HTTP responses from the Anthropic endpoint.
Sources
- Anthropic API error codes — April 2026
- Anthropic Python SDK — error handling — April 2026
- Anthropic rate limits — April 2026