Automatic Model Routing: Route Claude Requests to the Right Model at Runtime

Q: Does routing add latency?

Pattern 1 (rules): zero latency. Pattern 2 (Haiku router): 100-300ms. For user-facing applications, run the routing call concurrently with any other preparation work.

Q: What if I get the routing wrong?

Overrouting (expensive model for simple task) wastes money but produces correct results. Underrouting (cheap model for complex task) may produce lower quality. In practice, track underroute quality separately and tune your routing thresholds.

Q: Can I use a smaller model as router?

Yes — Haiku is already the cheapest available. You could also use a rule-based classifier (Pattern 1) with zero API cost.

Q: Should I always route or just always use Sonnet?

For applications where > 40% of traffic is genuinely SIMPLE (classifications, extractions, lookups), routing pays off immediately. For applications that are uniformly MEDIUM complexity, Sonnet-default with no router is simpler and nearly as optimal.

Model routing sends each request to the cheapest model that can handle it correctly — instead of sending everything to Opus (expensive) or everything to Haiku (insufficient for complex tasks). A well-tuned router captures 70-80% of the cost savings available from model selection without manual per-task decisions.

This guide shows three routing patterns, from simple to sophisticated, with full code.

What a router does

A router is a fast, cheap classifier that runs before the main model call. It reads the request and assigns it to a tier:

User request
    ↓
[Router: classify complexity]
    ├── Simple → Haiku ($1/$5 per 1M)
    ├── Medium → Sonnet ($3/$15 per 1M)
    └── Complex → Opus ($5/$25 per 1M)

The router must be:

Fast: adds < 100ms to p50 latency ideally
Cheap: cheaper than the savings it generates
Accurate enough: occasional misclassifications are acceptable; systematic ones are not

Pattern 1: Rule-based router (zero cost, instant)

The simplest router uses token count and keyword heuristics. No API call required.

import re
from enum import Enum

class Model(str, Enum):
    HAIKU = "claude-haiku-4-5"
    SONNET = "claude-sonnet-4-6"
    OPUS = "claude-opus-4-7"

# Token count estimator (rough: 4 chars per token)
def estimate_tokens(text: str) -> int:
    return len(text) // 4

# Keywords that signal complex reasoning
COMPLEX_KEYWORDS = re.compile(
    r"\b(refactor|architecture|security audit|review the entire|"
    r"migrate|synthesize|compare all|debug complex|"
    r"write a production|design system|explain why)\b",
    re.IGNORECASE
)

# Keywords that signal simple tasks
SIMPLE_KEYWORDS = re.compile(
    r"\b(classify|categorize|extract|summarize briefly|"
    r"translate|is this|true or false|yes or no|"
    r"what is the|define|list the)\b",
    re.IGNORECASE
)

def route_by_rules(prompt: str, context_tokens: int = 0) -> Model:
    total_tokens = estimate_tokens(prompt) + context_tokens

    # Long context → Sonnet or Opus
    if total_tokens > 100_000:
        return Model.OPUS
    if total_tokens > 20_000:
        return Model.SONNET

    # Keyword signals
    if COMPLEX_KEYWORDS.search(prompt):
        return Model.SONNET  # Conservative: Sonnet for "complex" keywords
    if SIMPLE_KEYWORDS.search(prompt):
        return Model.HAIKU

    # Default
    return Model.SONNET


# Usage
prompt = "What is the capital of France?"
model = route_by_rules(prompt)
print(model)  # claude-haiku-4-5

Strengths: zero latency, zero cost, fully deterministic.
Weaknesses: misses edge cases, requires manual tuning of keywords.

When to use: high-volume applications where a fast first-pass filter is worth 80% of the savings and you iterate the keyword list over time.

Pattern 2: Haiku-as-router (cheap API call)

Use Haiku itself to classify each request before calling the main model. The router call costs ~$0.002 per request; the savings on correctly downshifted requests are larger.

import anthropic
import json

client = anthropic.Anthropic()

ROUTING_SYSTEM_PROMPT = """You are a request classifier. Classify the user's request into exactly one category:

SIMPLE: Classification, extraction, translation, yes/no questions, short factual lookups.
  Examples: "classify this email", "extract the date", "what language is this?"

MEDIUM: Code generation (< 100 lines), summarization, structured output generation, 
  analysis with defined schema, customer-facing replies, Q&A over a provided document.
  Examples: "write a Python function to...", "summarize this article", "answer this support ticket"

COMPLEX: Multi-file refactoring, security audits, architecture review, synthesis across 
  multiple long documents, production code generation with broad context, hard reasoning.
  Examples: "review my entire codebase for security issues", "refactor this module to use..."

Respond with ONLY a JSON object: {"tier": "SIMPLE"|"MEDIUM"|"COMPLEX", "reason": "one sentence"}"""

TIER_TO_MODEL = {
    "SIMPLE": "claude-haiku-4-5",
    "MEDIUM": "claude-sonnet-4-6",
    "COMPLEX": "claude-opus-4-7",
}

def route_with_haiku(user_message: str) -> tuple[str, str]:
    """Returns (model_name, reason). Uses Haiku as the router."""
    response = client.messages.create(
        model="claude-haiku-4-5",  # Always use Haiku for routing
        max_tokens=64,
        system=ROUTING_SYSTEM_PROMPT,
        messages=[{"role": "user", "content": user_message[:500]}],  # Only first 500 chars needed
    )

    try:
        result = json.loads(response.content[0].text)
        tier = result.get("tier", "MEDIUM")
        reason = result.get("reason", "")
        model = TIER_TO_MODEL.get(tier, "claude-sonnet-4-6")
        return model, reason
    except (json.JSONDecodeError, IndexError, AttributeError):
        return "claude-sonnet-4-6", "routing parse error — defaulting to Sonnet"


def run_with_routing(user_message: str, system_prompt: str = "") -> str:
    """Route and execute the request on the appropriate model."""
    model, reason = route_with_haiku(user_message)
    print(f"Routed to {model} ({reason})")

    response = client.messages.create(
        model=model,
        max_tokens=4096,
        system=system_prompt,
        messages=[{"role": "user", "content": user_message}],
    )
    return response.content[0].text


# Test it
examples = [
    "Is 'machine learning' a type of AI?",
    "Write a Python function that parses ISO 8601 dates",
    "Review this 500-line auth module for security vulnerabilities and suggest refactoring",
]

for msg in examples:
    print(f"\nRequest: {msg[:60]}...")
    answer = run_with_routing(msg)
    print(f"Answer snippet: {answer[:100]}...")

Expected routing:

"Is 'machine learning' a type of AI?" → Haiku (SIMPLE)
"Write a Python function..." → Sonnet (MEDIUM)
"Review 500-line auth module..." → Opus (COMPLEX)

Cost of routing: each routing call sends ~500 tokens to Haiku → ~$0.0005. At 10K requests/month: $5 routing cost vs. typical savings of $300-500.

Pattern 3: TypeScript router with caching

For production TypeScript applications, this adds prompt caching to the routing system prompt (avoiding re-billing the routing instructions on every call).

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const ROUTING_SYSTEM = `You are a request classifier. Classify into SIMPLE, MEDIUM, or COMPLEX.

SIMPLE: classification, extraction, translation, yes/no, short facts.
MEDIUM: code gen <100 lines, summarization, structured output, Q&A on document.
COMPLEX: multi-file refactor, security audit, architecture review, >100K token synthesis.

Respond ONLY as JSON: {"tier":"SIMPLE"|"MEDIUM"|"COMPLEX","reason":"one sentence"}`;

const TIER_TO_MODEL: Record<string, string> = {
  SIMPLE: "claude-haiku-4-5",
  MEDIUM: "claude-sonnet-4-6",
  COMPLEX: "claude-opus-4-7",
};

interface RouteResult {
  model: string;
  tier: string;
  reason: string;
  routingCost: number;
}

async function routeRequest(userMessage: string): Promise<RouteResult> {
  const response = await client.messages.create({
    model: "claude-haiku-4-5",
    max_tokens: 64,
    system: [
      {
        type: "text",
        text: ROUTING_SYSTEM,
        cache_control: { type: "ephemeral" }, // Cache the routing instructions
      },
    ],
    messages: [
      {
        role: "user",
        content: userMessage.slice(0, 500), // Only first 500 chars for routing
      },
    ],
  });

  const routingCost =
    response.usage.input_tokens / 1_000_000 * 1.0 +
    response.usage.output_tokens / 1_000_000 * 5.0;

  try {
    const text =
      response.content[0].type === "text" ? response.content[0].text : "{}";
    const parsed = JSON.parse(text) as { tier?: string; reason?: string };
    const tier = parsed.tier ?? "MEDIUM";
    return {
      model: TIER_TO_MODEL[tier] ?? "claude-sonnet-4-6",
      tier,
      reason: parsed.reason ?? "",
      routingCost,
    };
  } catch {
    return {
      model: "claude-sonnet-4-6",
      tier: "MEDIUM",
      reason: "routing parse error",
      routingCost,
    };
  }
}

async function runWithRouting(
  userMessage: string,
  systemPrompt = ""
): Promise<string> {
  const { model, tier, reason, routingCost } = await routeRequest(userMessage);
  console.log(`Routed to ${model} (${tier}) — ${reason} [routing: $${routingCost.toFixed(5)}]`);

  const response = await client.messages.create({
    model,
    max_tokens: 4096,
    ...(systemPrompt && { system: systemPrompt }),
    messages: [{ role: "user", content: userMessage }],
  });

  const text = response.content.find((b) => b.type === "text");
  return text?.type === "text" ? text.text : "";
}

// Example
runWithRouting("Classify this support ticket: user says button is broken")
  .then(console.log);

Measuring router accuracy

Before deploying a router, validate it against a sample of your real traffic:

Collect 200+ real requests from your application logs (or representative examples)
Label each with the correct tier (manual review or run against Opus and use its output as ground truth)
Run your router against the set and count:
- Correct routes: cheap model, correct tier
- Overroutes: expensive model when a cheaper one would have worked (safe, costly)
- Underroutes: cheap model sent where expensive was needed (risky, check outputs)

Target: < 5% underroute rate (a Sonnet task incorrectly sent to Haiku). Overrouting wastes money but doesn't affect quality.

# Evaluate router accuracy
def evaluate_router(examples: list[dict], router_fn) -> dict:
    """
    examples: [{"message": str, "correct_tier": "SIMPLE"|"MEDIUM"|"COMPLEX"}]
    """
    tiers = {"SIMPLE": 0, "MEDIUM": 1, "COMPLEX": 2}
    results = {"correct": 0, "overroute": 0, "underroute": 0}

    for ex in examples:
        predicted_model, _ = router_fn(ex["message"])
        predicted_tier = {v: k for k, v in TIER_TO_MODEL.items()}[predicted_model]

        correct_idx = tiers[ex["correct_tier"]]
        predicted_idx = tiers[predicted_tier]

        if predicted_idx == correct_idx:
            results["correct"] += 1
        elif predicted_idx > correct_idx:
            results["overroute"] += 1  # Used more expensive model than needed
        else:
            results["underroute"] += 1  # Used cheaper model than needed

    total = len(examples)
    return {
        "accuracy": results["correct"] / total,
        "overroute_rate": results["overroute"] / total,
        "underroute_rate": results["underroute"] / total,
    }

Routing + caching combined

The highest-ROI pattern combines routing with prompt caching on the routed model:

def run_with_routing_and_caching(
    user_message: str,
    system_prompt: str,
    cached_context: str = "",  # Static context to cache
) -> str:
    model, _ = route_with_haiku(user_message)

    system = [{"type": "text", "text": system_prompt}]
    if cached_context:
        system.append({
            "type": "text",
            "text": cached_context,
            "cache_control": {"type": "ephemeral"},
        })

    response = client.messages.create(
        model=model,
        max_tokens=4096,
        system=system,
        messages=[{"role": "user", "content": user_message}],
    )
    return response.content[0].text

This gets you:

60-85% model cost reduction from routing
Additional 30-70% reduction from caching system prompts
Combined savings: often 80-90% vs. all-Opus no-cache baseline

FAQ

Does routing add latency? Pattern 1 (rules): zero latency. Pattern 2 (Haiku router): 100-300ms. For user-facing applications, run the routing call concurrently with any other preparation work.

What if I get the routing wrong? Overrouting (expensive model for simple task) wastes money but produces correct results. Underrouting (cheap model for complex task) may produce lower quality. In practice, track underroute quality separately and tune your routing thresholds.

Can I use a smaller model as router? Yes — Haiku is already the cheapest available. You could also use a rule-based classifier (Pattern 1) with zero API cost.

Should I always route or just always use Sonnet? For applications where > 40% of traffic is genuinely SIMPLE (classifications, extractions, lookups), routing pays off immediately. For applications that are uniformly MEDIUM complexity, Sonnet-default with no router is simpler and nearly as optimal.

Sources

Claude API model documentation — April 2026
Anthropic pricing — April 2026
Related: Haiku vs Sonnet vs Opus — model selection guide
Related: Prompt caching break-even analysis — when caching saves money

Frequently Asked Questions

What is automatic model routing for Claude API?

Model routing is a classifier that runs before each Claude API call and assigns the request to the cheapest model (Haiku, Sonnet, or Opus) that can handle it correctly. A well-tuned router captures 70–80% of available cost savings from model selection without requiring manual per-task decisions.

How much does a Haiku-based router cost compared to the savings it generates?

Each routing call sends roughly 500 tokens to Haiku, costing about $0.0005 per request. At 10,000 requests/month, routing costs $5. Typical savings from correctly downshifting 50–70% of requests from Sonnet to Haiku are $300–$500/month — a 60–100x return on the routing cost.

What is the difference between a rule-based router and a Haiku-as-router approach?

A rule-based router uses token count thresholds and keyword patterns to classify requests — zero latency and zero API cost, but requires manual keyword tuning and misses edge cases. A Haiku-as-router makes a cheap API call to Haiku which classifies the request into SIMPLE/MEDIUM/COMPLEX — more accurate, adds 100–300ms latency, and costs ~$0.0005 per request. Use the rule-based approach first; upgrade to Haiku-as-router if accuracy matters.

When should I just use Sonnet for everything instead of routing?

If more than 60% of your traffic is genuinely MEDIUM complexity (not clearly simple or complex), a Sonnet default with no router is simpler and nearly as cost-optimal. Routing pays off most when a large fraction of traffic is truly SIMPLE — classifications, extractions, yes/no questions — that Haiku handles equally well at 3x lower cost.

Take It Further

Claude API Cost Optimization Masterclass — The practical guide to cutting Claude API costs by 60–90% in production. Model tiering, prompt caching, Batch API, and token compression — with real numbers from 12 optimization scenarios.

PDF guide + Excel cost calculator.

→ Get Cost Optimization Masterclass — $59

30-day money-back guarantee. Instant download.

Automatic Claude Model Routing: Pick the Right Model at Runtime

Automatic Model Routing: Route Claude Requests to the Right Model at Runtime

What a router does

Pattern 1: Rule-based router (zero cost, instant)

Pattern 2: Haiku-as-router (cheap API call)

Pattern 3: TypeScript router with caching

Measuring router accuracy

Routing + caching combined

FAQ

Sources

Frequently Asked Questions

What is automatic model routing for Claude API?

How much does a Haiku-based router cost compared to the savings it generates?

What is the difference between a rule-based router and a Haiku-as-router approach?

When should I just use Sonnet for everything instead of routing?

Take It Further

Tools and references

Automatic Model Routing: Route Claude Requests to the Right Model at Runtime

What a router does

Pattern 1: Rule-based router (zero cost, instant)

Pattern 2: Haiku-as-router (cheap API call)

Pattern 3: TypeScript router with caching

Measuring router accuracy

Routing + caching combined

FAQ

Sources

Frequently Asked Questions

What is automatic model routing for Claude API?

How much does a Haiku-based router cost compared to the savings it generates?

What is the difference between a rule-based router and a Haiku-as-router approach?

When should I just use Sonnet for everything instead of routing?

Take It Further

Related guides

Claude API Pricing 2026: Complete Breakdown with Calculators

Haiku vs Sonnet vs Opus: Which Claude Model? (April 2026)

Claude Extended Thinking: How to Use Thinking Mode (2026 Guide)

Claude Extended Thinking: When Opus Is Worth the Extra Cost

Tools and references