← All guides

Deploying Claude Agents to Production: Fly.io, Vercel, and Lambda

A practical guide to deploying Claude agents to production infrastructure — Fly.io for persistent processes, Vercel for serverless, AWS Lambda for.

Deploying Claude Agents to Production: Fly.io, Vercel, and Lambda

The right deployment target for a Claude agent depends on one factor: does your agent need to run longer than 30 seconds? If yes, use Fly.io or a VPS. If no, Vercel or AWS Lambda works and costs less. Most agentic workflows — multi-step tool use, web research, code execution — exceed serverless time limits. Understanding the deployment envelope before you build saves a painful migration later.


The core constraint: agent execution time

Claude agents that use tools commonly run for 60–300 seconds:

Serverless platforms have hard limits:

For agents that routinely exceed these windows, you need a persistent process host.


Option 1: Fly.io (best for agents that need persistent processes)

Fly.io runs Docker containers on global edge hardware. VMs stay alive between requests, which means:

Cost: $0.00 (Hobby tier) for a single shared-CPU-1x VM with 256MB RAM. Most Python agent workers fit in 512MB–1GB.

Minimal Dockerfile for a Python agent

FROM python:3.12-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .
CMD ["python", "-m", "uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8080"]

FastAPI wrapper for an agent endpoint

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import anthropic

app = FastAPI()
client = anthropic.Anthropic()

class AgentRequest(BaseModel):
    task: str
    session_id: str | None = None

@app.post("/run")
async def run_agent(request: AgentRequest):
    """
    Run a Claude agent task. Streams response or returns final output.
    Long-running is fine — Fly.io has no hard time limit.
    """
    messages = []
    messages.append({"role": "user", "content": request.task})
    
    # Agent loop
    while True:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=4096,
            tools=TOOLS,  # your tool definitions
            messages=messages,
        )
        
        if response.stop_reason == "end_turn":
            # Extract final text response
            final_text = next(
                (block.text for block in response.content if hasattr(block, "text")),
                None,
            )
            return {"result": final_text, "session_id": request.session_id}
        
        if response.stop_reason == "tool_use":
            # Process tool calls and continue loop
            tool_results = await process_tool_calls(response.content)
            messages.append({"role": "assistant", "content": response.content})
            messages.append({"role": "user", "content": tool_results})
            continue
        
        break
    
    raise HTTPException(status_code=500, detail="Agent loop exited unexpectedly")

Deploy to Fly.io

# Install flyctl
curl -L https://fly.io/install.sh | sh

# Initialise (creates fly.toml)
fly launch --name my-claude-agent --region nrt  # nrt = Tokyo

# Set your Anthropic API key as a secret
fly secrets set ANTHROPIC_API_KEY=sk-ant-...

# Deploy
fly deploy

fly.toml configuration for agents

app = "my-claude-agent"
primary_region = "nrt"

[build]

[http_service]
  internal_port = 8080
  force_https = true
  auto_stop_machines = "stop"   # Stop when idle to save cost
  auto_start_machines = true    # Restart on incoming request
  min_machines_running = 0      # Can go to 0 when idle

[[vm]]
  size = "shared-cpu-1x"
  memory = "512mb"

Cold start: with auto_stop_machines = "stop", a stopped VM takes ~2–4 seconds to restart. For latency-sensitive agents, set min_machines_running = 1.


Option 2: Vercel (for agents under 60 seconds)

Use Vercel for agents that serve web requests or run quick tasks. The SDK setup is identical, but you must stay within the function timeout.

When Vercel works for agents:

Next.js API route for a Claude agent

// app/api/agent/route.ts
import Anthropic from "@anthropic-ai/sdk";
import { NextRequest } from "next/server";

const client = new Anthropic();

export const maxDuration = 60; // seconds (Vercel Pro limit)

export async function POST(request: NextRequest) {
  const { task } = await request.json();

  // Single-turn agent (no long loops — stay within time limit)
  const response = await client.messages.create({
    model: "claude-haiku-4-5", // Use Haiku for speed
    max_tokens: 1024,
    messages: [{ role: "user", content: task }],
  });

  return Response.json({
    result: response.content[0].type === "text" ? response.content[0].text : null,
  });
}

Streaming responses (better UX for long tasks)

For tasks that approach the time limit, stream the response so users see output as it generates:

export async function POST(request: NextRequest) {
  const { task } = await request.json();
  
  const stream = await client.messages.stream({
    model: "claude-sonnet-4-6",
    max_tokens: 2048,
    messages: [{ role: "user", content: task }],
  });
  
  // Return a ReadableStream — Vercel streams this to the client
  return new Response(
    new ReadableStream({
      async start(controller) {
        for await (const event of stream) {
          if (
            event.type === "content_block_delta" &&
            event.delta.type === "text_delta"
          ) {
            controller.enqueue(
              new TextEncoder().encode(event.delta.text)
            );
          }
        }
        controller.close();
      },
    }),
    { headers: { "Content-Type": "text/plain; charset=utf-8" } }
  );
}

Option 3: AWS Lambda (for event-driven agents)

AWS Lambda is ideal for agents triggered by events: S3 file uploads, SQS messages, DynamoDB stream events, or scheduled triggers.

Maximum execution time: 15 minutes — sufficient for most agentic tasks.

Lambda handler for a document-processing agent

import json
import boto3
import anthropic

client = anthropic.Anthropic()
s3 = boto3.client("s3")

def handler(event, context):
    """
    Triggered by S3 upload. Downloads file, processes with Claude,
    saves result to output bucket.
    """
    # Extract S3 event info
    bucket = event["Records"][0]["s3"]["bucket"]["name"]
    key = event["Records"][0]["s3"]["object"]["key"]
    
    # Download file content
    obj = s3.get_object(Bucket=bucket, Key=key)
    content = obj["Body"].read().decode("utf-8")
    
    # Run agent
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=4096,
        messages=[
            {
                "role": "user",
                "content": f"Analyse this document and extract key insights:\n\n{content}"
            }
        ]
    )
    
    result = response.content[0].text
    
    # Save result
    output_key = f"processed/{key}"
    s3.put_object(
        Bucket="output-bucket",
        Key=output_key,
        Body=result.encode("utf-8"),
    )
    
    return {"statusCode": 200, "key": output_key}

Lambda configuration for agents

# serverless.yml
service: claude-agent-processor

provider:
  name: aws
  runtime: python3.12
  region: ap-northeast-2  # Seoul
  timeout: 900            # 15 minutes
  memorySize: 1024        # 1GB — agents need memory for context
  environment:
    ANTHROPIC_API_KEY: ${ssm:/claude-agent/api-key}

functions:
  processDocument:
    handler: handler.handler
    events:
      - s3:
          bucket: input-bucket
          event: s3:ObjectCreated:*

Choosing the right target

Factor Fly.io Vercel Lambda
Max execution time Unlimited 60s–5min 15 minutes
Cold start 2–4s (stopped VM) 200ms–1s 500ms–3s
Cost (idle) $0 (VM stopped) $0 $0
Best for Long-running agents Web-integrated agents Event-driven agents
State persistence In-process (VM stays up) None (stateless) None (stateless)
WebSocket support Yes Limited (Pro) No
Concurrency Depends on VM count Automatic Up to 1000

Production hardening checklist

Before deploying any agent to production:

# Production Anthropic client configuration
client = anthropic.Anthropic(
    api_key=os.environ["ANTHROPIC_API_KEY"],
    timeout=60.0,      # 60s per individual API call
    max_retries=3,     # Automatic retry with exponential backoff
)

Frequently asked questions

Can I run a Claude agent on a standard VPS (DigitalOcean, Hetzner)? Yes. Any always-on Linux server works. Use systemd or supervisor to keep the process alive. This is often cheaper than Fly.io at scale but requires more ops work.

How do I handle Anthropic API rate limits in production? The Anthropic SDK retries automatically (3x by default with exponential backoff). For high-throughput agents, use the rate limit headers (x-ratelimit-remaining-requests, x-ratelimit-remaining-tokens) to proactively throttle before hitting limits.

What's the cheapest way to run a low-traffic agent? Fly.io with auto_stop_machines = "stop" — the VM stops when idle and you pay only for execution time. For very low traffic (<10 requests/day), AWS Lambda is similarly cheap but has a 15-minute cap.

Should I use Vercel Edge Functions or Node.js Functions for agents? Node.js Functions (not Edge). Edge Functions have a 25MB bundle limit and restricted APIs that may conflict with the Anthropic SDK. Node.js Functions have the full runtime.

How do I deploy agent state across deployments? Never store agent state in the VM's memory across requests. Use an external database (Neon, PlanetScale) or key-value store (Upstash Redis). VMs restart; databases persist.


Related guides


Take It Further

Claude Agent SDK Cookbook: 40 Production Patterns — Pattern 22 covers the full Production Deployment Architecture: Fly.io vs Vercel vs Lambda decision tree, Docker configurations, rate limit handling, cost guard implementation, and the production hardening checklist.

→ Get the Agent SDK Cookbook — $49

30-day money-back guarantee. Instant download.

AI Disclosure: Drafted with Claude Code; all deployment patterns from production usage as of April 2026.

Tools and references