Claude Agent SDK: 5 Subagent Patterns for Production (2026)

Q: How do I budget cost across subagents?

Tag every spawn with a costCenter metadata field, log token usage per spawn, and aggregate by tag. The Agent SDK's usage callback fires per child — wire it into your existing observability. Set a hard spend ceiling at the orchestration layer; abort the parent loop when it trips. The cost monitoring guide walks through the exact pattern.

Q: Subagents vs MCP servers?

Different layers. MCP servers expose tools (a Postgres reader, a Linear client, a filesystem). Subagents are reasoning units that use tools. A subagent often calls an MCP server. They compose — they do not compete. If your question is "where should this capability live," MCP for tools, subagents for cognition.

Subagents in the Claude Agent SDK are isolated child agents that the parent spawns to handle a scoped task — each with its own context window, model choice, and tool allowlist. After watching 12 production deployments through May 2026 cost-monitoring traces, five patterns clearly outperform the rest: divide-and-conquer, specialist routing, parallel research, judge-and-iterate, and error-recovery. Each one simultaneously lowers cost (Haiku handles the cheap subtasks) and raises quality (specialized prompts beat one mega-prompt). If you are still running everything inside a single 200K-token agent, you are leaving both money and reliability on the table.

This guide shows when each pattern wins, when it loses, and exactly how to wire it up with @anthropic-ai/agent-sdk.

Why subagents

A single-agent loop has two failure modes that compound as the task grows:

Context bloat. Every tool result, every reasoning step, every retried call piles into the same context window. By turn 30, the agent is paying input tokens on a wall of text it barely uses. With Sonnet at $3/M input and no cache, a 150K-token context costs $0.45 per turn even before output.
Drift. A prompt that says "you are a senior code reviewer, but also a translator, but also a test writer" is three prompts smashed together. The model averages across roles and does each one worse.

Subagents fix both. The parent stays small and orchestrational. Each child gets a tight prompt, a fresh context, and only the tools it needs. When the child returns, the parent receives a summary — not the entire transcript — so the parent's context grows by tens of tokens, not tens of thousands.

For a refresher on the SDK basics, see the Claude Agent SDK guide.

Pattern 1 — Divide and conquer

Shape: parent decomposes the task into N independent subtasks, fans out one subagent per subtask, then synthesizes the results.

When it wins: the work is genuinely parallel — converting 50 markdown files, summarizing 20 PDFs, generating boilerplate for 12 routes. Each subagent runs Haiku and finishes in seconds.

Anti-pattern: subtasks have hidden dependencies. If subtask 7 needs the output of subtask 3, do not fan out — sequence it.

import { query } from "@anthropic-ai/agent-sdk";

const files = await listInputFiles();
const summaries = await Promise.all(
  files.map(async (file) => {
    const result = await query({
      model: "claude-haiku-4-5",
      prompt: `Summarize this file in 3 bullets:\n\n${file.content}`,
      maxTokens: 300,
    });
    return { path: file.path, summary: result.text };
  })
);

const finalReport = await query({
  model: "claude-sonnet-4-7",
  prompt: `Synthesize these ${summaries.length} summaries into a release note:\n\n${JSON.stringify(summaries)}`,
});

The parent never sees the raw file contents — only the 3-bullet summaries. Total parent context stays under 5K tokens even across 50 files.

Pattern 2 — Specialist routing

Shape: parent inspects the incoming task, classifies it, and routes to a curated subagent (code-reviewer, translator, tester, sql-writer).

When it wins: your product handles a fixed taxonomy of task types. A single "do everything" prompt would be 8K tokens of conditionals — the router prompt is 200 tokens.

Each specialist gets:

A focused system prompt (300–800 tokens)
A narrow tool allowlist (a translator does not need Bash)
A model choice that fits the job (Haiku for translation, Sonnet for code review, Opus only when you genuinely need it)

The router itself runs on Haiku. Classification is a cheap call.

Pattern 3 — Parallel research

Shape: spawn N subagents to investigate the same question from different angles, then have the parent merge.

This is the pattern Anthropic's own research agent uses. One subagent searches for counter-evidence, one searches for primary sources, one looks at recent news, one checks academic literature. Because they run on Haiku in parallel, you get four independent passes for less than one Sonnet pass would cost.

The trick is angle differentiation in the subagent prompts. If all four subagents have the same prompt, you are paying 4× for one answer. Give each one a distinct stance: skeptic, optimist, historian, contrarian. Pair this with prompt caching on the shared system block so each spawn pays cache-read rates.

Pattern 4 — Judge-and-iterate

Shape: subagent A drafts. Subagent B critiques against a rubric. A revises. Repeat until B approves or you hit max iterations (3 is the sweet spot).

When it wins: quality-sensitive output where you have a clear rubric — marketing copy against brand voice, SQL against a style guide, API docs against a checklist.

When it loses: open-ended creative work. Without a rubric, the judge oscillates and the writer chases a moving target.

Always cap iterations. An infinite judge-revise loop is the most expensive bug in a multi-agent system — we have seen one customer rack up $340 in a single overnight run.

async function draftWithJudge(brief: string, maxIterations = 3) {
  let draft = await writer(brief);
  for (let i = 0; i < maxIterations; i++) {
    const verdict = await judge(draft, brief);
    if (verdict.approved) return { draft, iterations: i + 1 };
    draft = await writer(`${brief}\n\nRevise based on this critique:\n${verdict.critique}\n\nPrior draft:\n${draft}`);
  }
  return { draft, iterations: maxIterations, warning: "judge never approved" };
}

The judge runs Haiku with a yes/no schema output. The writer runs Sonnet. Even three full iterations cost less than one Opus call would.

Mid-article CTA: if you want the full cost-control playbook — model routing decision tree, cache-warming patterns, batch-API break-evens, and the exact rubrics the judge-and-iterate pattern uses — grab the Cost Optimization Masterclass. It is the same workbook our team uses on every production deployment.

Pattern 5 — Error recovery

Shape: parent catches a tool error or unexpected state, spawns a "debugger" subagent with the failure context, and applies the diagnosis.

The parent does not try to debug in its own loop. Instead it isolates the failure into a fresh context with a debug-focused prompt: "Here is a tool call, here is the error, here is the recent state. Diagnose root cause and propose a fix."

This pattern is the unsung hero of long-running agents. Without it, a single tool failure pollutes the parent's context with 50 turns of trial-and-error reasoning, which then degrades every subsequent decision. With it, the parent stays clean — the messy debugging happens in a throwaway child.

Cost math: when subagents save vs. cost more

Scenario	Single agent (Sonnet)	With subagents (Haiku children + Sonnet parent)	Savings
20-file summarization	~$0.85 (giant context)	~$0.12 (20 Haiku spawns + 1 parent merge)	86%
Single-shot Q&A, 2K tokens	~$0.012	~$0.018 (router overhead)	−50% (worse)
50-page research report	~$2.40	~$0.55 (parallel research, 5 angles)	77%
Translate one paragraph	~$0.004	~$0.006	−50% (worse)
30-turn coding loop	~$1.90 (drift + bloat)	~$0.70 (specialist routing)	63%

The pattern is clear: subagents win on scale and complexity. They lose on single-shot trivial tasks because the routing/spawn overhead is a fixed cost. Track this empirically with Claude API cost monitoring — your numbers will differ from ours.

When NOT to use subagents

Latency-sensitive UX (chat replies under 1.5s). Each spawn adds 200–600ms of model startup. A user-facing chatbot answering a single question should not route through three children.
Single-shot tasks with no decomposition opportunity. Just call the model.
No clear specialization. If you cannot write a distinct system prompt for each child, you do not have specialists — you have copies.
Tiny context windows. If the whole task fits in 4K tokens and finishes in two turns, the orchestration overhead exceeds the savings.

Common pitfalls

Context bleed. A subagent that "helpfully" returns its full transcript instead of a summary blows up the parent's context. Enforce a return schema: { result: string, citations?: string[], confidence: number }. Reject anything else.

Infinite loops. Always cap iterations on judge-and-iterate, retry-on-error, and any pattern with feedback. Set a hard wall-clock timeout on the parent loop too.

Output format mismatch. The parent expects JSON; the child returns prose with a JSON code block. Use the SDK's responseFormat: "json" or schema-validated tool outputs — do not parse free text.

Tool over-grant. Giving every subagent every tool defeats the isolation. A summarizer does not need bash. A translator does not need web_search. Be miserly.

Ignoring cache. Subagents that share a system prompt should reuse it via cache control. Without caching, parallel research is paying full input rates four times.

Frequently Asked Questions

Can subagents call other subagents?

Yes — the SDK does not restrict depth. In practice, keep depth at 2 (parent → child). Three levels deep, debugging becomes nightmare-grade and cost attribution is murky. If you need three levels, restructure: the grandchild's job probably belongs to the parent's orchestration logic.

Do subagents share memory with the parent?

No. Each subagent gets its own fresh context. The parent passes a prompt in; the subagent returns a result out. That isolation is the entire point — it is why context stays small. If you need shared state, pass it explicitly through the prompt or through an external store (file, database, KV).

How do I budget cost across subagents?

Tag every spawn with a costCenter metadata field, log token usage per spawn, and aggregate by tag. The Agent SDK's usage callback fires per child — wire it into your existing observability. Set a hard spend ceiling at the orchestration layer; abort the parent loop when it trips. The cost monitoring guide walks through the exact pattern.

Subagents vs MCP servers?

Different layers. MCP servers expose tools (a Postgres reader, a Linear client, a filesystem). Subagents are reasoning units that use tools. A subagent often calls an MCP server. They compose — they do not compete. If your question is "where should this capability live," MCP for tools, subagents for cognition.

What's the max parallelism?

The API itself will accept hundreds of concurrent requests, but your account-level rate limits will cap you well before that. Production deployments we monitored top out around 8–12 parallel subagent spawns before hitting tier limits. Use a semaphore in your code to throttle — do not rely on the API to push back gracefully.

Bottom line: subagents are not a "more agents = better" trick. They are a context-management discipline that happens to also save money. Use the five patterns where they fit, skip them where they do not, and measure everything. The teams that win with the Agent SDK in 2026 are the ones that treat orchestration as a first-class design problem — not an afterthought.