← All guides

Building a Content Generation Agent with Quality Checks

How to build a production Claude content generation agent with multi-stage quality checks — draft, review, fact-check, and publish pipeline with.

Building a Content Generation Agent with Quality Checks

A content generation agent without quality checks produces high volume and inconsistent quality — the opposite of what content teams need. A production content agent runs output through a multi-stage pipeline: draft generation, self-review, fact extraction, format validation, and a human approval gate before anything reaches your CMS. This guide builds that full pipeline with Claude Agent SDK, including the quality check patterns that prevent bad content from shipping.


The Architecture

A naive content agent writes and publishes in one step. A production agent has stages with validation gates between them.

Input spec → Draft generation → Self-review → Structure validation → Fact extraction → [Human gate] → CMS publish

Each stage can reject and retry the previous stage's output. Human gates are optional — use them for high-stakes content, skip them for lower-stakes bulk generation.


Stage 1: Draft Generation

import anthropic
import json
from dataclasses import dataclass
from typing import Optional

client = anthropic.Anthropic()


@dataclass
class ContentSpec:
    title: str
    primary_keyword: str
    target_length: int  # words
    audience: str  # "junior developers", "CTOs", etc.
    tone: str  # "technical", "conversational", "authoritative"
    required_sections: list[str]
    cta_product: Optional[str] = None


@dataclass
class DraftResult:
    content: str
    word_count: int
    sections_found: list[str]
    metadata: dict


def generate_draft(spec: ContentSpec) -> DraftResult:
    """Stage 1: Generate initial draft from spec."""

    system_prompt = f"""You are a technical content writer specializing in developer tools.

Writing constraints:
- Target length: {spec.target_length} words (±10%)
- Audience: {spec.audience}
- Tone: {spec.tone}
- Primary keyword: "{spec.primary_keyword}" — use naturally in title, first paragraph, and 2-3 subheadings
- Required sections: {', '.join(spec.required_sections)}

Quality standards:
- First paragraph (40-60 words) must directly answer the primary question
- Every major claim needs a specific example or number
- Code examples must be complete and runnable
- No filler phrases ("In today's world", "As we know")
- End with a FAQ section (5 questions minimum)"""

    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=8000,
        system=system_prompt,
        messages=[{
            "role": "user",
            "content": f"Write an article titled: '{spec.title}'\n\nPrimary keyword to target: {spec.primary_keyword}"
        }]
    )

    content = response.content[0].text
    word_count = len(content.split())

    # Extract sections found
    import re
    sections_found = re.findall(r'^#{1,3} (.+)$', content, re.MULTILINE)

    return DraftResult(
        content=content,
        word_count=word_count,
        sections_found=sections_found,
        metadata={
            "model": "claude-sonnet-4-5",
            "input_tokens": response.usage.input_tokens,
            "output_tokens": response.usage.output_tokens,
        }
    )

Stage 2: Self-Review

Have Claude critique its own draft. This is surprisingly effective — a second pass with a different system prompt catches issues the first pass misses.

@dataclass
class ReviewResult:
    passed: bool
    score: int  # 0-100
    issues: list[str]
    suggestions: list[str]
    revised_content: Optional[str]


def self_review(draft: DraftResult, spec: ContentSpec) -> ReviewResult:
    """Stage 2: Claude reviews its own draft for quality issues."""

    review_schema = {
        "type": "object",
        "properties": {
            "score": {"type": "integer", "minimum": 0, "maximum": 100},
            "passed": {"type": "boolean"},
            "issues": {"type": "array", "items": {"type": "string"}},
            "suggestions": {"type": "array", "items": {"type": "string"}},
            "needs_revision": {"type": "boolean"}
        },
        "required": ["score", "passed", "issues", "suggestions", "needs_revision"]
    }

    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=2000,
        system="""You are a senior editor reviewing technical content. Be critical and specific.
        
Review criteria:
1. First paragraph answers the question directly (not vague)
2. Every claim has a specific number, example, or reference
3. Code examples are complete and correct
4. Required sections are all present
5. No filler, no fluff, no generic statements
6. FAQ section has 5+ substantive questions
7. Word count within 10% of target

Respond ONLY with valid JSON matching the provided schema. No prose before or after.""",
        messages=[
            {
                "role": "user",
                "content": f"""Review this article draft. Target: {spec.target_length} words, required sections: {spec.required_sections}

DRAFT:
{draft.content[:6000]}  # Truncate for token efficiency

Respond with JSON:
{json.dumps(review_schema, indent=2)}"""
            }
        ]
    )

    try:
        review_data = json.loads(response.content[0].text)
    except json.JSONDecodeError:
        # Fallback: extract JSON from response
        import re
        json_match = re.search(r'\{.*\}', response.content[0].text, re.DOTALL)
        if json_match:
            review_data = json.loads(json_match.group())
        else:
            # Review failed — treat as passed to avoid blocking
            return ReviewResult(passed=True, score=70, issues=[], suggestions=[], revised_content=None)

    revised_content = None
    if review_data.get("needs_revision") and review_data.get("issues"):
        revised_content = apply_revision(draft.content, review_data["issues"], spec)

    return ReviewResult(
        passed=review_data["passed"],
        score=review_data["score"],
        issues=review_data["issues"],
        suggestions=review_data["suggestions"],
        revised_content=revised_content
    )


def apply_revision(content: str, issues: list[str], spec: ContentSpec) -> str:
    """Apply corrections based on review issues."""
    issues_text = "\n".join(f"- {issue}" for issue in issues)

    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=8000,
        system="You are a technical editor. Fix the specific issues listed. Return the full revised article.",
        messages=[{
            "role": "user",
            "content": f"""Fix these specific issues in the article:

ISSUES TO FIX:
{issues_text}

ARTICLE:
{content}

Return the complete revised article, fixing only the listed issues."""
        }]
    )

    return response.content[0].text

Stage 3: Structure Validation

Enforce structural requirements programmatically — don't rely on LLM to count sections correctly.

import re
from dataclasses import dataclass


@dataclass
class ValidationResult:
    passed: bool
    failures: list[str]
    warnings: list[str]
    metrics: dict


def validate_structure(content: str, spec: ContentSpec) -> ValidationResult:
    """Stage 3: Programmatic structure validation."""
    failures = []
    warnings = []

    # Word count check
    word_count = len(content.split())
    target_min = int(spec.target_length * 0.9)
    target_max = int(spec.target_length * 1.1)

    if word_count < target_min:
        failures.append(f"Too short: {word_count} words (target: {spec.target_length}±10%)")
    elif word_count > target_max:
        warnings.append(f"Slightly long: {word_count} words (target: {spec.target_length}±10%)")

    # Required sections check
    content_lower = content.lower()
    for section in spec.required_sections:
        if section.lower() not in content_lower:
            failures.append(f"Missing required section: '{section}'")

    # FAQ check
    faq_questions = re.findall(r'^\*\*[^*]+\?\*\*|^#{1,3}.*\?', content, re.MULTILINE)
    if len(faq_questions) < 5:
        warnings.append(f"Only {len(faq_questions)} FAQ questions found (recommend 5+)")

    # Keyword check
    keyword_count = content_lower.count(spec.primary_keyword.lower())
    if keyword_count < 2:
        warnings.append(f"Primary keyword '{spec.primary_keyword}' appears only {keyword_count} times")
    elif keyword_count > 10:
        warnings.append(f"Keyword stuffing risk: '{spec.primary_keyword}' appears {keyword_count} times")

    # Code block check
    code_blocks = re.findall(r'```', content)
    if len(code_blocks) < 2:  # At least one code block (open + close = 2 ``` markers)
        warnings.append("No code examples found — consider adding concrete examples")

    # First paragraph length check
    paragraphs = content.split('\n\n')
    non_heading_paras = [p for p in paragraphs if not p.startswith('#')]
    if non_heading_paras:
        first_para_words = len(non_heading_paras[0].split())
        if first_para_words > 100:
            warnings.append(f"First paragraph too long ({first_para_words} words) — aim for 40-60")
        elif first_para_words < 20:
            warnings.append(f"First paragraph too short ({first_para_words} words)")

    return ValidationResult(
        passed=len(failures) == 0,
        failures=failures,
        warnings=warnings,
        metrics={
            "word_count": word_count,
            "sections_found": len(re.findall(r'^#{1,3} ', content, re.MULTILINE)),
            "faq_questions": len(faq_questions),
            "keyword_density": keyword_count / word_count * 100 if word_count > 0 else 0
        }
    )

Orchestrating the Full Pipeline

import time
from enum import Enum


class PipelineStatus(Enum):
    DRAFTING = "drafting"
    REVIEWING = "reviewing"
    VALIDATING = "validating"
    AWAITING_APPROVAL = "awaiting_approval"
    APPROVED = "approved"
    REJECTED = "rejected"
    FAILED = "failed"


def run_content_pipeline(
    spec: ContentSpec,
    require_human_approval: bool = True,
    max_revision_attempts: int = 2
) -> dict:
    """
    Full content generation pipeline with quality checks.
    Returns the final content and pipeline metadata.
    """
    pipeline_log = []
    start_time = time.time()

    def log(stage: str, status: str, details: dict = None):
        entry = {"stage": stage, "status": status, "elapsed_s": round(time.time() - start_time, 1)}
        if details:
            entry.update(details)
        pipeline_log.append(entry)
        print(f"[{entry['elapsed_s']}s] {stage}: {status}")

    # Stage 1: Draft
    log("draft", "started")
    draft = generate_draft(spec)
    log("draft", "completed", {"word_count": draft.word_count})

    # Stage 2: Self-review (with retry)
    final_content = draft.content

    for attempt in range(max_revision_attempts + 1):
        log("review", f"attempt {attempt + 1}")
        review = self_review(DraftResult(
            content=final_content,
            word_count=len(final_content.split()),
            sections_found=[],
            metadata={}
        ), spec)

        if review.passed or attempt == max_revision_attempts:
            if review.revised_content:
                final_content = review.revised_content
                log("review", "revision_applied")
            log("review", "completed", {
                "score": review.score,
                "issues": len(review.issues),
                "passed": review.passed
            })
            break

    # Stage 3: Structure validation
    log("validation", "started")
    validation = validate_structure(final_content, spec)
    log("validation", "completed", {
        "passed": validation.passed,
        "failures": validation.failures,
        "warnings": validation.warnings
    })

    if not validation.passed:
        return {
            "status": PipelineStatus.FAILED,
            "reason": "Validation failed",
            "failures": validation.failures,
            "pipeline_log": pipeline_log
        }

    # Stage 4: Human approval gate (optional)
    if require_human_approval:
        log("human_gate", "awaiting_approval")
        print("\n" + "="*60)
        print(f"CONTENT READY FOR REVIEW: {spec.title}")
        print(f"Word count: {validation.metrics['word_count']}")
        print(f"Review score: {review.score}/100")
        if validation.warnings:
            print(f"Warnings: {validation.warnings}")
        print("\nFirst 500 chars:")
        print(final_content[:500])
        print("="*60)

        approval = input("\nApprove for publish? (y/n/edit): ").strip().lower()
        if approval != "y":
            return {
                "status": PipelineStatus.REJECTED,
                "reason": f"Human rejected: {approval}",
                "content": final_content,
                "pipeline_log": pipeline_log
            }

    return {
        "status": PipelineStatus.APPROVED,
        "content": final_content,
        "title": spec.title,
        "slug": spec.primary_keyword.replace('"', '').replace(' ', '-').lower(),
        "metrics": validation.metrics,
        "review_score": review.score,
        "pipeline_log": pipeline_log,
        "total_elapsed_s": round(time.time() - start_time, 1)
    }


# Usage
spec = ContentSpec(
    title="How to Build a REST API with FastAPI and Claude",
    primary_keyword="claude fastapi tutorial",
    target_length=2500,
    audience="intermediate Python developers",
    tone="technical",
    required_sections=["Quick Answer", "Setup", "Implementation", "FAQ"],
    cta_product="Agent SDK Cookbook"
)

result = run_content_pipeline(spec, require_human_approval=True)

if result["status"] == PipelineStatus.APPROVED:
    print(f"\nContent approved: {result['metrics']['word_count']} words, score {result['review_score']}/100")
    # Write to file, publish to CMS, etc.

Frequently Asked Questions

What quality score should I require before human review? Require a minimum score of 70 before sending to human review. Scores below 70 indicate structural problems that warrant automatic revision. Set your threshold based on your content standards — for SEO-critical content, 80+ is appropriate.

How do I prevent the agent from self-approving bad content? Two controls: (1) The self-review uses a different system prompt focused on finding problems, not finding reasons to pass. (2) The programmatic validation stage catches structural issues regardless of LLM review score. LLMs are optimistic about their own output — the validation layer must be rule-based.

Should I use the same model for drafting and review? Sonnet for both is effective — the key is the different system prompt and role framing. For highest quality, you can use Opus for the review stage to catch subtle issues Sonnet misses.

How much does this pipeline cost per article? At Sonnet pricing: draft generation (~4k tokens out) ≈ $0.06, self-review (~1k out) ≈ $0.02, revision if needed (~4k out) ≈ $0.06. Total: $0.10-0.15 per article. For bulk generation of 100 articles/month, under $15 in API costs.

Can I skip the human approval gate for automated bulk content? Yes — set require_human_approval=False. Ensure your validation rules are comprehensive before fully automating. A good minimum: word count within range, all required sections present, review score ≥ 80.


Related Guides


Go Deeper

Agent SDK Cookbook — $49 — Full content pipeline implementation with CMS integrations (Contentful, Notion, Ghost), bulk generation orchestration, quality scoring dashboard, and cost tracking by content type.

→ Get the Agent SDK Cookbook — $49

30-day money-back guarantee. Instant download.

AI Disclosure: Written with Claude Code; patterns used in production content pipelines.

Tools and references