Deploying Claude Agents to Production: Fly.io, Vercel, and Lambda

Q: What's the cheapest way to run a low-traffic agent?

Fly.io with auto_stop_machines = "stop" — the VM stops when idle and you pay only for execution time. For very low traffic (<10 requests/day), AWS Lambda is similarly cheap but has a 15-minute cap.

Q: Should I use Vercel Edge Functions or Node.js Functions for agents?

Node.js Functions (not Edge). Edge Functions have a 25MB bundle limit and restricted APIs that may conflict with the Anthropic SDK. Node.js Functions have the full runtime.

Q: How do I deploy agent state across deployments?

Never store agent state in the VM's memory across requests. Use an external database (Neon, PlanetScale) or key-value store (Upstash Redis). VMs restart; databases persist.

The right deployment target for a Claude agent depends on one factor: does your agent need to run longer than 30 seconds? If yes, use Fly.io or a VPS. If no, Vercel or AWS Lambda works and costs less. Most agentic workflows — multi-step tool use, web research, code execution — exceed serverless time limits. Understanding the deployment envelope before you build saves a painful migration later.

The core constraint: agent execution time

Claude agents that use tools commonly run for 60–300 seconds:

A research agent fetching 5 web pages + summarising: ~90 seconds
A code debugging agent running tests iteratively: 2–5 minutes
A document processing agent chunking a 100-page PDF: 3–8 minutes

Serverless platforms have hard limits:

Vercel Functions: 60 seconds (Pro), 300 seconds (Enterprise)
AWS Lambda: 15 minutes maximum
Cloudflare Workers: 30 seconds CPU time

For agents that routinely exceed these windows, you need a persistent process host.

Option 1: Fly.io (best for agents that need persistent processes)

Fly.io runs Docker containers on global edge hardware. VMs stay alive between requests, which means:

No cold start penalty for the Claude SDK client
Long-running agentic loops without time limits
WebSocket connections for streaming responses

Cost: $0.00 (Hobby tier) for a single shared-CPU-1x VM with 256MB RAM. Most Python agent workers fit in 512MB–1GB.

Minimal Dockerfile for a Python agent

FROM python:3.12-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .
CMD ["python", "-m", "uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8080"]

FastAPI wrapper for an agent endpoint

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import anthropic

app = FastAPI()
client = anthropic.Anthropic()

class AgentRequest(BaseModel):
    task: str
    session_id: str | None = None

@app.post("/run")
async def run_agent(request: AgentRequest):
    """
    Run a Claude agent task. Streams response or returns final output.
    Long-running is fine — Fly.io has no hard time limit.
    """
    messages = []
    messages.append({"role": "user", "content": request.task})
    
    # Agent loop
    while True:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=4096,
            tools=TOOLS,  # your tool definitions
            messages=messages,
        )
        
        if response.stop_reason == "end_turn":
            # Extract final text response
            final_text = next(
                (block.text for block in response.content if hasattr(block, "text")),
                None,
            )
            return {"result": final_text, "session_id": request.session_id}
        
        if response.stop_reason == "tool_use":
            # Process tool calls and continue loop
            tool_results = await process_tool_calls(response.content)
            messages.append({"role": "assistant", "content": response.content})
            messages.append({"role": "user", "content": tool_results})
            continue
        
        break
    
    raise HTTPException(status_code=500, detail="Agent loop exited unexpectedly")

Deploy to Fly.io

# Install flyctl
curl -L https://fly.io/install.sh | sh

# Initialise (creates fly.toml)
fly launch --name my-claude-agent --region nrt  # nrt = Tokyo

# Set your Anthropic API key as a secret
fly secrets set ANTHROPIC_API_KEY=sk-ant-...

# Deploy
fly deploy

fly.toml configuration for agents

app = "my-claude-agent"
primary_region = "nrt"

[build]

[http_service]
  internal_port = 8080
  force_https = true
  auto_stop_machines = "stop"   # Stop when idle to save cost
  auto_start_machines = true    # Restart on incoming request
  min_machines_running = 0      # Can go to 0 when idle

[[vm]]
  size = "shared-cpu-1x"
  memory = "512mb"

Cold start: with auto_stop_machines = "stop", a stopped VM takes ~2–4 seconds to restart. For latency-sensitive agents, set min_machines_running = 1.

Option 2: Vercel (for agents under 60 seconds)

Use Vercel for agents that serve web requests or run quick tasks. The SDK setup is identical, but you must stay within the function timeout.

When Vercel works for agents:

Agents that classify or route text: under 5 seconds
Agents that answer questions from a vector store: 10–30 seconds
Agents that do limited tool use (1–3 tools): 20–45 seconds

Next.js API route for a Claude agent

// app/api/agent/route.ts
import Anthropic from "@anthropic-ai/sdk";
import { NextRequest } from "next/server";

const client = new Anthropic();

export const maxDuration = 60; // seconds (Vercel Pro limit)

export async function POST(request: NextRequest) {
  const { task } = await request.json();

  // Single-turn agent (no long loops — stay within time limit)
  const response = await client.messages.create({
    model: "claude-haiku-4-5", // Use Haiku for speed
    max_tokens: 1024,
    messages: [{ role: "user", content: task }],
  });

  return Response.json({
    result: response.content[0].type === "text" ? response.content[0].text : null,
  });
}

Streaming responses (better UX for long tasks)

For tasks that approach the time limit, stream the response so users see output as it generates:

export async function POST(request: NextRequest) {
  const { task } = await request.json();
  
  const stream = await client.messages.stream({
    model: "claude-sonnet-4-6",
    max_tokens: 2048,
    messages: [{ role: "user", content: task }],
  });
  
  // Return a ReadableStream — Vercel streams this to the client
  return new Response(
    new ReadableStream({
      async start(controller) {
        for await (const event of stream) {
          if (
            event.type === "content_block_delta" &&
            event.delta.type === "text_delta"
          ) {
            controller.enqueue(
              new TextEncoder().encode(event.delta.text)
            );
          }
        }
        controller.close();
      },
    }),
    { headers: { "Content-Type": "text/plain; charset=utf-8" } }
  );
}

Option 3: AWS Lambda (for event-driven agents)

AWS Lambda is ideal for agents triggered by events: S3 file uploads, SQS messages, DynamoDB stream events, or scheduled triggers.

Maximum execution time: 15 minutes — sufficient for most agentic tasks.

Lambda handler for a document-processing agent

import json
import boto3
import anthropic

client = anthropic.Anthropic()
s3 = boto3.client("s3")

def handler(event, context):
    """
    Triggered by S3 upload. Downloads file, processes with Claude,
    saves result to output bucket.
    """
    # Extract S3 event info
    bucket = event["Records"][0]["s3"]["bucket"]["name"]
    key = event["Records"][0]["s3"]["object"]["key"]
    
    # Download file content
    obj = s3.get_object(Bucket=bucket, Key=key)
    content = obj["Body"].read().decode("utf-8")
    
    # Run agent
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=4096,
        messages=[
            {
                "role": "user",
                "content": f"Analyse this document and extract key insights:\n\n{content}"
            }
        ]
    )
    
    result = response.content[0].text
    
    # Save result
    output_key = f"processed/{key}"
    s3.put_object(
        Bucket="output-bucket",
        Key=output_key,
        Body=result.encode("utf-8"),
    )
    
    return {"statusCode": 200, "key": output_key}

Lambda configuration for agents

# serverless.yml
service: claude-agent-processor

provider:
  name: aws
  runtime: python3.12
  region: ap-northeast-2  # Seoul
  timeout: 900            # 15 minutes
  memorySize: 1024        # 1GB — agents need memory for context
  environment:
    ANTHROPIC_API_KEY: ${ssm:/claude-agent/api-key}

functions:
  processDocument:
    handler: handler.handler
    events:
      - s3:
          bucket: input-bucket
          event: s3:ObjectCreated:*

Choosing the right target

Factor	Fly.io	Vercel	Lambda
Max execution time	Unlimited	60s–5min	15 minutes
Cold start	2–4s (stopped VM)	200ms–1s	500ms–3s
Cost (idle)	$0 (VM stopped)	$0	$0
Best for	Long-running agents	Web-integrated agents	Event-driven agents
State persistence	In-process (VM stays up)	None (stateless)	None (stateless)
WebSocket support	Yes	Limited (Pro)	No
Concurrency	Depends on VM count	Automatic	Up to 1000

Production hardening checklist

Before deploying any agent to production:

Rate limiting: Wrap the Anthropic client with retry logic and backoff
Cost guards: Set a maximum loop iteration count (e.g., max_iterations = 20)
Error handling: Catch anthropic.APIError, anthropic.RateLimitError, anthropic.APITimeoutError
Logging: Log every messages.create() call with token counts for cost monitoring
Secrets: Store ANTHROPIC_API_KEY in a secrets manager, never in environment variables in plaintext CI
Timeout enforcement: Set timeout on the Anthropic client to prevent hanging requests

# Production Anthropic client configuration
client = anthropic.Anthropic(
    api_key=os.environ["ANTHROPIC_API_KEY"],
    timeout=60.0,      # 60s per individual API call
    max_retries=3,     # Automatic retry with exponential backoff
)

Frequently asked questions

Can I run a Claude agent on a standard VPS (DigitalOcean, Hetzner)? Yes. Any always-on Linux server works. Use systemd or supervisor to keep the process alive. This is often cheaper than Fly.io at scale but requires more ops work.

How do I handle Anthropic API rate limits in production? The Anthropic SDK retries automatically (3x by default with exponential backoff). For high-throughput agents, use the rate limit headers (x-ratelimit-remaining-requests, x-ratelimit-remaining-tokens) to proactively throttle before hitting limits.

What's the cheapest way to run a low-traffic agent? Fly.io with auto_stop_machines = "stop" — the VM stops when idle and you pay only for execution time. For very low traffic (<10 requests/day), AWS Lambda is similarly cheap but has a 15-minute cap.

Should I use Vercel Edge Functions or Node.js Functions for agents? Node.js Functions (not Edge). Edge Functions have a 25MB bundle limit and restricted APIs that may conflict with the Anthropic SDK. Node.js Functions have the full runtime.

How do I deploy agent state across deployments? Never store agent state in the VM's memory across requests. Use an external database (Neon, PlanetScale) or key-value store (Upstash Redis). VMs restart; databases persist.

Related guides

Claude Agent SDK: Build Your First Agent in 30 Minutes — agent fundamentals before deployment
Memory and State in Claude Agents: Patterns That Scale — managing state across requests and sessions

Take It Further

Claude Agent SDK Cookbook: 40 Production Patterns — Pattern 22 covers the full Production Deployment Architecture: Fly.io vs Vercel vs Lambda decision tree, Docker configurations, rate limit handling, cost guard implementation, and the production hardening checklist.

→ Get the Agent SDK Cookbook — $49

30-day money-back guarantee. Instant download.

Deploying Claude Agents to Production: Fly.io, Vercel, and Lambda

The core constraint: agent execution time

Option 1: Fly.io (best for agents that need persistent processes)

Minimal Dockerfile for a Python agent

FastAPI wrapper for an agent endpoint

Deploy to Fly.io

fly.toml configuration for agents

Option 2: Vercel (for agents under 60 seconds)

Next.js API route for a Claude agent

Streaming responses (better UX for long tasks)

Option 3: AWS Lambda (for event-driven agents)

Lambda handler for a document-processing agent

Lambda configuration for agents

Choosing the right target

Production hardening checklist

Frequently asked questions

Related guides

Take It Further

Related guides

Claude Code for Kubernetes and DevOps: YAML Generation and Automation

Streaming vs Batch in Claude Agent SDK: When to Use Which

Claude Agents for DevOps: Monitoring, Alerting, and Automated Remediation

Testing and Evaluating Claude Agents: A Production Guide

Tools and references