← All guides

Automatic Model Routing: Route Claude Requests to the Right Model at Runtime

How to build a lightweight classifier that automatically routes each Claude API request to Haiku, Sonnet, or Opus based on task complexity — with full code.

Automatic Model Routing: Route Claude Requests to the Right Model at Runtime

Model routing sends each request to the cheapest model that can handle it correctly — instead of sending everything to Opus (expensive) or everything to Haiku (insufficient for complex tasks). A well-tuned router captures 70-80% of the cost savings available from model selection without manual per-task decisions.

This guide shows three routing patterns, from simple to sophisticated, with full code.

What a router does

A router is a fast, cheap classifier that runs before the main model call. It reads the request and assigns it to a tier:

User request
    ↓
[Router: classify complexity]
    ├── Simple → Haiku ($1/$5 per 1M)
    ├── Medium → Sonnet ($3/$15 per 1M)
    └── Complex → Opus ($5/$25 per 1M)

The router must be:


Pattern 1: Rule-based router (zero cost, instant)

The simplest router uses token count and keyword heuristics. No API call required.

import re
from enum import Enum

class Model(str, Enum):
    HAIKU = "claude-haiku-4-5"
    SONNET = "claude-sonnet-4-6"
    OPUS = "claude-opus-4-7"

# Token count estimator (rough: 4 chars per token)
def estimate_tokens(text: str) -> int:
    return len(text) // 4

# Keywords that signal complex reasoning
COMPLEX_KEYWORDS = re.compile(
    r"\b(refactor|architecture|security audit|review the entire|"
    r"migrate|synthesize|compare all|debug complex|"
    r"write a production|design system|explain why)\b",
    re.IGNORECASE
)

# Keywords that signal simple tasks
SIMPLE_KEYWORDS = re.compile(
    r"\b(classify|categorize|extract|summarize briefly|"
    r"translate|is this|true or false|yes or no|"
    r"what is the|define|list the)\b",
    re.IGNORECASE
)

def route_by_rules(prompt: str, context_tokens: int = 0) -> Model:
    total_tokens = estimate_tokens(prompt) + context_tokens

    # Long context → Sonnet or Opus
    if total_tokens > 100_000:
        return Model.OPUS
    if total_tokens > 20_000:
        return Model.SONNET

    # Keyword signals
    if COMPLEX_KEYWORDS.search(prompt):
        return Model.SONNET  # Conservative: Sonnet for "complex" keywords
    if SIMPLE_KEYWORDS.search(prompt):
        return Model.HAIKU

    # Default
    return Model.SONNET


# Usage
prompt = "What is the capital of France?"
model = route_by_rules(prompt)
print(model)  # claude-haiku-4-5

Strengths: zero latency, zero cost, fully deterministic.
Weaknesses: misses edge cases, requires manual tuning of keywords.

When to use: high-volume applications where a fast first-pass filter is worth 80% of the savings and you iterate the keyword list over time.


Pattern 2: Haiku-as-router (cheap API call)

Use Haiku itself to classify each request before calling the main model. The router call costs ~$0.002 per request; the savings on correctly downshifted requests are larger.

import anthropic
import json

client = anthropic.Anthropic()

ROUTING_SYSTEM_PROMPT = """You are a request classifier. Classify the user's request into exactly one category:

SIMPLE: Classification, extraction, translation, yes/no questions, short factual lookups.
  Examples: "classify this email", "extract the date", "what language is this?"

MEDIUM: Code generation (< 100 lines), summarization, structured output generation, 
  analysis with defined schema, customer-facing replies, Q&A over a provided document.
  Examples: "write a Python function to...", "summarize this article", "answer this support ticket"

COMPLEX: Multi-file refactoring, security audits, architecture review, synthesis across 
  multiple long documents, production code generation with broad context, hard reasoning.
  Examples: "review my entire codebase for security issues", "refactor this module to use..."

Respond with ONLY a JSON object: {"tier": "SIMPLE"|"MEDIUM"|"COMPLEX", "reason": "one sentence"}"""

TIER_TO_MODEL = {
    "SIMPLE": "claude-haiku-4-5",
    "MEDIUM": "claude-sonnet-4-6",
    "COMPLEX": "claude-opus-4-7",
}

def route_with_haiku(user_message: str) -> tuple[str, str]:
    """Returns (model_name, reason). Uses Haiku as the router."""
    response = client.messages.create(
        model="claude-haiku-4-5",  # Always use Haiku for routing
        max_tokens=64,
        system=ROUTING_SYSTEM_PROMPT,
        messages=[{"role": "user", "content": user_message[:500]}],  # Only first 500 chars needed
    )

    try:
        result = json.loads(response.content[0].text)
        tier = result.get("tier", "MEDIUM")
        reason = result.get("reason", "")
        model = TIER_TO_MODEL.get(tier, "claude-sonnet-4-6")
        return model, reason
    except (json.JSONDecodeError, IndexError, AttributeError):
        return "claude-sonnet-4-6", "routing parse error — defaulting to Sonnet"


def run_with_routing(user_message: str, system_prompt: str = "") -> str:
    """Route and execute the request on the appropriate model."""
    model, reason = route_with_haiku(user_message)
    print(f"Routed to {model} ({reason})")

    response = client.messages.create(
        model=model,
        max_tokens=4096,
        system=system_prompt,
        messages=[{"role": "user", "content": user_message}],
    )
    return response.content[0].text


# Test it
examples = [
    "Is 'machine learning' a type of AI?",
    "Write a Python function that parses ISO 8601 dates",
    "Review this 500-line auth module for security vulnerabilities and suggest refactoring",
]

for msg in examples:
    print(f"\nRequest: {msg[:60]}...")
    answer = run_with_routing(msg)
    print(f"Answer snippet: {answer[:100]}...")

Expected routing:

Cost of routing: each routing call sends ~500 tokens to Haiku → ~$0.0005. At 10K requests/month: $5 routing cost vs. typical savings of $300-500.


Pattern 3: TypeScript router with caching

For production TypeScript applications, this adds prompt caching to the routing system prompt (avoiding re-billing the routing instructions on every call).

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const ROUTING_SYSTEM = `You are a request classifier. Classify into SIMPLE, MEDIUM, or COMPLEX.

SIMPLE: classification, extraction, translation, yes/no, short facts.
MEDIUM: code gen <100 lines, summarization, structured output, Q&A on document.
COMPLEX: multi-file refactor, security audit, architecture review, >100K token synthesis.

Respond ONLY as JSON: {"tier":"SIMPLE"|"MEDIUM"|"COMPLEX","reason":"one sentence"}`;

const TIER_TO_MODEL: Record<string, string> = {
  SIMPLE: "claude-haiku-4-5",
  MEDIUM: "claude-sonnet-4-6",
  COMPLEX: "claude-opus-4-7",
};

interface RouteResult {
  model: string;
  tier: string;
  reason: string;
  routingCost: number;
}

async function routeRequest(userMessage: string): Promise<RouteResult> {
  const response = await client.messages.create({
    model: "claude-haiku-4-5",
    max_tokens: 64,
    system: [
      {
        type: "text",
        text: ROUTING_SYSTEM,
        cache_control: { type: "ephemeral" }, // Cache the routing instructions
      },
    ],
    messages: [
      {
        role: "user",
        content: userMessage.slice(0, 500), // Only first 500 chars for routing
      },
    ],
  });

  const routingCost =
    response.usage.input_tokens / 1_000_000 * 1.0 +
    response.usage.output_tokens / 1_000_000 * 5.0;

  try {
    const text =
      response.content[0].type === "text" ? response.content[0].text : "{}";
    const parsed = JSON.parse(text) as { tier?: string; reason?: string };
    const tier = parsed.tier ?? "MEDIUM";
    return {
      model: TIER_TO_MODEL[tier] ?? "claude-sonnet-4-6",
      tier,
      reason: parsed.reason ?? "",
      routingCost,
    };
  } catch {
    return {
      model: "claude-sonnet-4-6",
      tier: "MEDIUM",
      reason: "routing parse error",
      routingCost,
    };
  }
}

async function runWithRouting(
  userMessage: string,
  systemPrompt = ""
): Promise<string> {
  const { model, tier, reason, routingCost } = await routeRequest(userMessage);
  console.log(`Routed to ${model} (${tier}) — ${reason} [routing: $${routingCost.toFixed(5)}]`);

  const response = await client.messages.create({
    model,
    max_tokens: 4096,
    ...(systemPrompt && { system: systemPrompt }),
    messages: [{ role: "user", content: userMessage }],
  });

  const text = response.content.find((b) => b.type === "text");
  return text?.type === "text" ? text.text : "";
}

// Example
runWithRouting("Classify this support ticket: user says button is broken")
  .then(console.log);

Measuring router accuracy

Before deploying a router, validate it against a sample of your real traffic:

  1. Collect 200+ real requests from your application logs (or representative examples)
  2. Label each with the correct tier (manual review or run against Opus and use its output as ground truth)
  3. Run your router against the set and count:
    • Correct routes: cheap model, correct tier
    • Overroutes: expensive model when a cheaper one would have worked (safe, costly)
    • Underroutes: cheap model sent where expensive was needed (risky, check outputs)

Target: < 5% underroute rate (a Sonnet task incorrectly sent to Haiku). Overrouting wastes money but doesn't affect quality.

# Evaluate router accuracy
def evaluate_router(examples: list[dict], router_fn) -> dict:
    """
    examples: [{"message": str, "correct_tier": "SIMPLE"|"MEDIUM"|"COMPLEX"}]
    """
    tiers = {"SIMPLE": 0, "MEDIUM": 1, "COMPLEX": 2}
    results = {"correct": 0, "overroute": 0, "underroute": 0}

    for ex in examples:
        predicted_model, _ = router_fn(ex["message"])
        predicted_tier = {v: k for k, v in TIER_TO_MODEL.items()}[predicted_model]

        correct_idx = tiers[ex["correct_tier"]]
        predicted_idx = tiers[predicted_tier]

        if predicted_idx == correct_idx:
            results["correct"] += 1
        elif predicted_idx > correct_idx:
            results["overroute"] += 1  # Used more expensive model than needed
        else:
            results["underroute"] += 1  # Used cheaper model than needed

    total = len(examples)
    return {
        "accuracy": results["correct"] / total,
        "overroute_rate": results["overroute"] / total,
        "underroute_rate": results["underroute"] / total,
    }

Routing + caching combined

The highest-ROI pattern combines routing with prompt caching on the routed model:

def run_with_routing_and_caching(
    user_message: str,
    system_prompt: str,
    cached_context: str = "",  # Static context to cache
) -> str:
    model, _ = route_with_haiku(user_message)

    system = [{"type": "text", "text": system_prompt}]
    if cached_context:
        system.append({
            "type": "text",
            "text": cached_context,
            "cache_control": {"type": "ephemeral"},
        })

    response = client.messages.create(
        model=model,
        max_tokens=4096,
        system=system,
        messages=[{"role": "user", "content": user_message}],
    )
    return response.content[0].text

This gets you:


FAQ

Does routing add latency? Pattern 1 (rules): zero latency. Pattern 2 (Haiku router): 100-300ms. For user-facing applications, run the routing call concurrently with any other preparation work.

What if I get the routing wrong? Overrouting (expensive model for simple task) wastes money but produces correct results. Underrouting (cheap model for complex task) may produce lower quality. In practice, track underroute quality separately and tune your routing thresholds.

Can I use a smaller model as router? Yes — Haiku is already the cheapest available. You could also use a rule-based classifier (Pattern 1) with zero API cost.

Should I always route or just always use Sonnet? For applications where > 40% of traffic is genuinely SIMPLE (classifications, extractions, lookups), routing pays off immediately. For applications that are uniformly MEDIUM complexity, Sonnet-default with no router is simpler and nearly as optimal.

Sources

  1. Claude API model documentation — April 2026
  2. Anthropic pricing — April 2026
  3. Related: Haiku vs Sonnet vs Opus — model selection guide
AI Disclosure: Drafted with Claude Code; examples verified with Anthropic SDK April 2026.