Deploying Claude Agents to Production: Fly.io, Vercel, and Lambda
The right deployment target for a Claude agent depends on one factor: does your agent need to run longer than 30 seconds? If yes, use Fly.io or a VPS. If no, Vercel or AWS Lambda works and costs less. Most agentic workflows — multi-step tool use, web research, code execution — exceed serverless time limits. Understanding the deployment envelope before you build saves a painful migration later.
The core constraint: agent execution time
Claude agents that use tools commonly run for 60–300 seconds:
- A research agent fetching 5 web pages + summarising: ~90 seconds
- A code debugging agent running tests iteratively: 2–5 minutes
- A document processing agent chunking a 100-page PDF: 3–8 minutes
Serverless platforms have hard limits:
- Vercel Functions: 60 seconds (Pro), 300 seconds (Enterprise)
- AWS Lambda: 15 minutes maximum
- Cloudflare Workers: 30 seconds CPU time
For agents that routinely exceed these windows, you need a persistent process host.
Option 1: Fly.io (best for agents that need persistent processes)
Fly.io runs Docker containers on global edge hardware. VMs stay alive between requests, which means:
- No cold start penalty for the Claude SDK client
- Long-running agentic loops without time limits
- WebSocket connections for streaming responses
Cost: $0.00 (Hobby tier) for a single shared-CPU-1x VM with 256MB RAM. Most Python agent workers fit in 512MB–1GB.
Minimal Dockerfile for a Python agent
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "-m", "uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8080"]
FastAPI wrapper for an agent endpoint
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import anthropic
app = FastAPI()
client = anthropic.Anthropic()
class AgentRequest(BaseModel):
task: str
session_id: str | None = None
@app.post("/run")
async def run_agent(request: AgentRequest):
"""
Run a Claude agent task. Streams response or returns final output.
Long-running is fine — Fly.io has no hard time limit.
"""
messages = []
messages.append({"role": "user", "content": request.task})
# Agent loop
while True:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
tools=TOOLS, # your tool definitions
messages=messages,
)
if response.stop_reason == "end_turn":
# Extract final text response
final_text = next(
(block.text for block in response.content if hasattr(block, "text")),
None,
)
return {"result": final_text, "session_id": request.session_id}
if response.stop_reason == "tool_use":
# Process tool calls and continue loop
tool_results = await process_tool_calls(response.content)
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
continue
break
raise HTTPException(status_code=500, detail="Agent loop exited unexpectedly")
Deploy to Fly.io
# Install flyctl
curl -L https://fly.io/install.sh | sh
# Initialise (creates fly.toml)
fly launch --name my-claude-agent --region nrt # nrt = Tokyo
# Set your Anthropic API key as a secret
fly secrets set ANTHROPIC_API_KEY=sk-ant-...
# Deploy
fly deploy
fly.toml configuration for agents
app = "my-claude-agent"
primary_region = "nrt"
[build]
[http_service]
internal_port = 8080
force_https = true
auto_stop_machines = "stop" # Stop when idle to save cost
auto_start_machines = true # Restart on incoming request
min_machines_running = 0 # Can go to 0 when idle
[[vm]]
size = "shared-cpu-1x"
memory = "512mb"
Cold start: with auto_stop_machines = "stop", a stopped VM takes ~2–4 seconds to restart. For latency-sensitive agents, set min_machines_running = 1.
Option 2: Vercel (for agents under 60 seconds)
Use Vercel for agents that serve web requests or run quick tasks. The SDK setup is identical, but you must stay within the function timeout.
When Vercel works for agents:
- Agents that classify or route text: under 5 seconds
- Agents that answer questions from a vector store: 10–30 seconds
- Agents that do limited tool use (1–3 tools): 20–45 seconds
Next.js API route for a Claude agent
// app/api/agent/route.ts
import Anthropic from "@anthropic-ai/sdk";
import { NextRequest } from "next/server";
const client = new Anthropic();
export const maxDuration = 60; // seconds (Vercel Pro limit)
export async function POST(request: NextRequest) {
const { task } = await request.json();
// Single-turn agent (no long loops — stay within time limit)
const response = await client.messages.create({
model: "claude-haiku-4-5", // Use Haiku for speed
max_tokens: 1024,
messages: [{ role: "user", content: task }],
});
return Response.json({
result: response.content[0].type === "text" ? response.content[0].text : null,
});
}
Streaming responses (better UX for long tasks)
For tasks that approach the time limit, stream the response so users see output as it generates:
export async function POST(request: NextRequest) {
const { task } = await request.json();
const stream = await client.messages.stream({
model: "claude-sonnet-4-6",
max_tokens: 2048,
messages: [{ role: "user", content: task }],
});
// Return a ReadableStream — Vercel streams this to the client
return new Response(
new ReadableStream({
async start(controller) {
for await (const event of stream) {
if (
event.type === "content_block_delta" &&
event.delta.type === "text_delta"
) {
controller.enqueue(
new TextEncoder().encode(event.delta.text)
);
}
}
controller.close();
},
}),
{ headers: { "Content-Type": "text/plain; charset=utf-8" } }
);
}
Option 3: AWS Lambda (for event-driven agents)
AWS Lambda is ideal for agents triggered by events: S3 file uploads, SQS messages, DynamoDB stream events, or scheduled triggers.
Maximum execution time: 15 minutes — sufficient for most agentic tasks.
Lambda handler for a document-processing agent
import json
import boto3
import anthropic
client = anthropic.Anthropic()
s3 = boto3.client("s3")
def handler(event, context):
"""
Triggered by S3 upload. Downloads file, processes with Claude,
saves result to output bucket.
"""
# Extract S3 event info
bucket = event["Records"][0]["s3"]["bucket"]["name"]
key = event["Records"][0]["s3"]["object"]["key"]
# Download file content
obj = s3.get_object(Bucket=bucket, Key=key)
content = obj["Body"].read().decode("utf-8")
# Run agent
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
messages=[
{
"role": "user",
"content": f"Analyse this document and extract key insights:\n\n{content}"
}
]
)
result = response.content[0].text
# Save result
output_key = f"processed/{key}"
s3.put_object(
Bucket="output-bucket",
Key=output_key,
Body=result.encode("utf-8"),
)
return {"statusCode": 200, "key": output_key}
Lambda configuration for agents
# serverless.yml
service: claude-agent-processor
provider:
name: aws
runtime: python3.12
region: ap-northeast-2 # Seoul
timeout: 900 # 15 minutes
memorySize: 1024 # 1GB — agents need memory for context
environment:
ANTHROPIC_API_KEY: ${ssm:/claude-agent/api-key}
functions:
processDocument:
handler: handler.handler
events:
- s3:
bucket: input-bucket
event: s3:ObjectCreated:*
Choosing the right target
| Factor | Fly.io | Vercel | Lambda |
|---|---|---|---|
| Max execution time | Unlimited | 60s–5min | 15 minutes |
| Cold start | 2–4s (stopped VM) | 200ms–1s | 500ms–3s |
| Cost (idle) | $0 (VM stopped) | $0 | $0 |
| Best for | Long-running agents | Web-integrated agents | Event-driven agents |
| State persistence | In-process (VM stays up) | None (stateless) | None (stateless) |
| WebSocket support | Yes | Limited (Pro) | No |
| Concurrency | Depends on VM count | Automatic | Up to 1000 |
Production hardening checklist
Before deploying any agent to production:
- Rate limiting: Wrap the Anthropic client with retry logic and backoff
- Cost guards: Set a maximum loop iteration count (e.g.,
max_iterations = 20) - Error handling: Catch
anthropic.APIError,anthropic.RateLimitError,anthropic.APITimeoutError - Logging: Log every
messages.create()call with token counts for cost monitoring - Secrets: Store
ANTHROPIC_API_KEYin a secrets manager, never in environment variables in plaintext CI - Timeout enforcement: Set
timeouton the Anthropic client to prevent hanging requests
# Production Anthropic client configuration
client = anthropic.Anthropic(
api_key=os.environ["ANTHROPIC_API_KEY"],
timeout=60.0, # 60s per individual API call
max_retries=3, # Automatic retry with exponential backoff
)
Frequently asked questions
Can I run a Claude agent on a standard VPS (DigitalOcean, Hetzner)? Yes. Any always-on Linux server works. Use systemd or supervisor to keep the process alive. This is often cheaper than Fly.io at scale but requires more ops work.
How do I handle Anthropic API rate limits in production?
The Anthropic SDK retries automatically (3x by default with exponential backoff). For high-throughput agents, use the rate limit headers (x-ratelimit-remaining-requests, x-ratelimit-remaining-tokens) to proactively throttle before hitting limits.
What's the cheapest way to run a low-traffic agent?
Fly.io with auto_stop_machines = "stop" — the VM stops when idle and you pay only for execution time. For very low traffic (<10 requests/day), AWS Lambda is similarly cheap but has a 15-minute cap.
Should I use Vercel Edge Functions or Node.js Functions for agents? Node.js Functions (not Edge). Edge Functions have a 25MB bundle limit and restricted APIs that may conflict with the Anthropic SDK. Node.js Functions have the full runtime.
How do I deploy agent state across deployments? Never store agent state in the VM's memory across requests. Use an external database (Neon, PlanetScale) or key-value store (Upstash Redis). VMs restart; databases persist.
Related guides
- Claude Agent SDK: Build Your First Agent in 30 Minutes — agent fundamentals before deployment
- Memory and State in Claude Agents: Patterns That Scale — managing state across requests and sessions
Take It Further
Claude Agent SDK Cookbook: 40 Production Patterns — Pattern 22 covers the full Production Deployment Architecture: Fly.io vs Vercel vs Lambda decision tree, Docker configurations, rate limit handling, cost guard implementation, and the production hardening checklist.
→ Get the Agent SDK Cookbook — $49
30-day money-back guarantee. Instant download.