Read OSS

Multi-Agent Orchestration — How Feature-Dev and Code-Review Coordinate AI Agents

Advanced

Prerequisites

  • Article 2: Plugin System Deep Dive (understanding agents and commands)
  • Familiarity with Claude model tiers (Haiku, Sonnet, Opus)

Multi-Agent Orchestration — How Feature-Dev and Code Review Coordinate AI Agents

The component model from Part 2 gives us the building blocks. But the real power of Claude Code's plugin system emerges when commands orchestrate multiple agents — potentially at different model tiers — into coordinated workflows. Three distinct orchestration patterns appear across the official plugins, each solving a different problem.

Feature-dev implements a phased human-in-the-loop workflow where parallel exploration agents feed into a clarifying question gate before architecture and implementation phases. Code review implements a multi-pass validation pipeline where initial findings are independently verified by fresh subagents. And ralph-wiggum implements a self-referential loop where a Stop hook intercepts agent exit to create iterative execution cycles.

These aren't theoretical patterns — they're production implementations that reveal how prompt engineering, model tier selection, and lifecycle hooks combine to create sophisticated AI workflows.

Feature-Dev: 7-Phase Human-in-the-Loop Workflow

The feature-dev plugin at plugins/feature-dev/commands/feature-dev.md implements a seven-phase workflow that alternates between AI agent work and human decision points:

flowchart TD
    P1["Phase 1: Discovery<br/>Understand the request"] --> P2
    P2["Phase 2: Codebase Exploration<br/>2-3 code-explorer agents ∥"] --> P3
    P3["Phase 3: Clarifying Questions<br/>🚫 CRITICAL HUMAN GATE"] --> P4
    P4["Phase 4: Architecture Design<br/>2-3 code-architect agents ∥"] --> P5
    P5["Phase 5: Implementation<br/>🚫 REQUIRES USER APPROVAL"] --> P6
    P6["Phase 6: Quality Review<br/>3 code-reviewer agents ∥"] --> P7
    P7["Phase 7: Summary<br/>Document outcomes"]

    style P3 fill:#E53935,color:#fff
    style P5 fill:#E53935,color:#fff
    style P2 fill:#FDD835
    style P4 fill:#4CAF50,color:#fff
    style P6 fill:#E53935,color:#fff

The design philosophy is explicit in the command file: "Ask clarifying questions" and "Understand before acting" are listed as core principles. Let's trace each phase.

Phase 1 (Discovery) is lightweight — create a todo list, confirm understanding. No agents launched.

Phase 2 (Codebase Exploration) launches 2–3 code-explorer agents in parallel, each targeting a different aspect of the codebase. The command specifies example prompts like "Find features similar to [feature]" and "Map the architecture and abstractions for [feature area]." After agents return, the orchestrating command reads all files they identified to build deep context. This is a breadth-first exploration pattern.

Phase 3 (Clarifying Questions) is marked CRITICAL — do not skip. The command explicitly states: "Present all questions to the user in a clear, organized list" and "Wait for answers before proceeding to architecture design." Even if the user says "whatever you think is best," the command instructs Claude to provide its recommendation and get explicit confirmation. This gate prevents the workflow from making expensive architecture and implementation decisions on false assumptions.

plugins/feature-dev/commands/feature-dev.md#L56-L69

Phase 4 (Architecture Design) launches 2–3 code-architect agents in parallel with different optimization targets: minimal changes, clean architecture, or pragmatic balance. The orchestrator then forms its own opinion and presents trade-offs to the user.

Phase 5 (Implementation) has the boldest instruction in the entire file: "DO NOT START WITHOUT USER APPROVAL" (line 89). This is the second human gate — no code is written until the user explicitly approves the chosen architecture.

Phase 6 (Quality Review) launches three code-reviewer agents with different focuses: simplicity/DRY, bugs/correctness, and project conventions. Results are consolidated and presented to the user for decision.

Agent Design: Model Tiers and Tool Constraints

The feature-dev plugin uses a single model tier (Sonnet) for all three agent types, but with carefully differentiated tool sets and instructions. This is a deliberate cost-optimization choice — Sonnet is fast and cheap for the read-heavy exploration and review work these agents perform.

The code-review plugin uses a different strategy. Looking at the command file at plugins/code-review/commands/code-review.md#L14-L55, it explicitly specifies different model tiers for different roles:

Step Agent Type Model Why
Step 1 (gate check) Single agent Haiku Fast, cheap — just checking if PR is closed/draft
Step 2 (find CLAUDE.md) Single agent Haiku File listing, no reasoning needed
Step 3 (PR summary) Single agent Sonnet Moderate reasoning for summarization
Step 4 (parallel review) Agents 1-2 Sonnet CLAUDE.md compliance — pattern matching
Step 4 (parallel review) Agents 3-4 Opus Bug detection — deep reasoning required
Step 5 (validation) Per-issue subagents Opus/Sonnet Validating bugs (Opus) vs. CLAUDE.md (Sonnet)

This is deliberate model-appropriate tiering. Haiku handles simple checks at minimal cost. Sonnet handles pattern matching and compliance. Opus handles the hard reasoning — finding subtle bugs in code diffs. The cost difference between model tiers makes this a meaningful architectural decision.

Tip: When designing multi-agent workflows, start by asking "what's the minimum model tier that can do this job well?" Use Haiku for gatekeeping, Sonnet for structured analysis, and Opus only where deep reasoning justifies the cost.

Code-Review: Multi-Pass Validation Pipeline

The code-review plugin implements a nine-step pipeline that's more interesting than a simple "fan-out, fan-in" pattern. Its key innovation is per-finding validation — each issue found in the review step is independently verified by a fresh subagent.

flowchart TD
    S1["Step 1: Gate Check<br/>(haiku)"] -->|"PR open & needs review"| S2
    S1 -->|"PR closed/draft/trivial"| STOP["Stop"]
    S2["Step 2: Find CLAUDE.md files<br/>(haiku)"] --> S3
    S3["Step 3: PR Summary<br/>(sonnet)"] --> S4

    S4["Step 4: Parallel Review"] --> S4A["Agent 1: CLAUDE.md<br/>(sonnet)"]
    S4 --> S4B["Agent 2: CLAUDE.md<br/>(sonnet)"]
    S4 --> S4C["Agent 3: Bug scan<br/>(opus)"]
    S4 --> S4D["Agent 4: Deep bugs<br/>(opus)"]

    S4A --> S5["Step 5: Per-Issue Validation<br/>(parallel subagents)"]
    S4B --> S5
    S4C --> S5
    S4D --> S5

    S5 --> S6["Step 6: Filter Unvalidated"]
    S6 --> S7["Step 7: Output Summary"]
    S7 -->|"--comment flag"| S8["Steps 8-9: Post Inline Comments"]

    style S1 fill:#78909C,color:#fff
    style S4C fill:#9C27B0,color:#fff
    style S4D fill:#9C27B0,color:#fff
    style S5 fill:#FF5722,color:#fff

Step 5 is where this pipeline differentiates itself. Rather than trusting the review agents' output directly, the command specifies: "For each issue found in the previous step by agents 3 and 4, launch parallel subagents to validate the issue." Each validation subagent receives just the PR title/description and the specific issue to check. This fresh-eyes approach catches false positives that the original review agent might have been biased toward.

The filtering criteria are explicitly anti-noise:

plugins/code-review/commands/code-review.md#L79-L86

Pre-existing issues, correct-looking bugs, pedantic nitpicks, linter-catchable issues, and general quality concerns are all listed as false positives to avoid. The command explicitly states: "False positives erode trust and waste reviewer time."

PR Review Toolkit: 6-Agent Specialization

The pr-review-toolkit at plugins/pr-review-toolkit/commands/review-pr.md takes a different approach: six specialized agents, each deeply focused on one dimension of code quality:

flowchart LR
    CMD["/review-pr"] --> SCOPE["Determine<br/>Review Scope"]
    SCOPE --> CA["comment-analyzer<br/>Comment accuracy"]
    SCOPE --> TA["pr-test-analyzer<br/>Test coverage"]
    SCOPE --> SFH["silent-failure-hunter<br/>Error handling"]
    SCOPE --> TDA["type-design-analyzer<br/>Type design"]
    SCOPE --> CR["code-reviewer<br/>General quality"]
    SCOPE --> CS["code-simplifier<br/>Simplification"]

    CA --> AGG["Aggregate<br/>Results"]
    TA --> AGG
    SFH --> AGG
    TDA --> AGG
    CR --> AGG
    CS --> AGG

    AGG --> OUTPUT["Categorized<br/>Output"]

The key difference from code-review: pr-review-toolkit runs agents sequentially by default (parallel on request), applies agents conditionally based on what files changed (type-design-analyzer only runs if types were added/modified), and groups findings into Critical/Important/Suggestions/Positive categories. It's a breadth-focused approach compared to code-review's depth-focused validation pipeline.

This gives teams two choices for PR review depending on their needs: code-review for high-signal bug detection with validation, or pr-review-toolkit for comprehensive multi-dimensional analysis.

The Ralph Wiggum Pattern: Self-Referential Agent Loops

The ralph-wiggum plugin implements something genuinely novel — a self-referential loop where Claude works on the same task repeatedly, seeing its previous work, until a completion condition is met.

The mechanism is a Stop hook. When Claude tries to exit, the hook at plugins/ralph-wiggum/hooks/stop-hook.sh intercepts the exit, reads the transcript, and re-injects the original prompt:

flowchart TD
    START["/ralph-loop PROMPT"] --> SETUP["Setup: Write state file<br/>.claude/ralph-loop.local.md"]
    SETUP --> WORK["Claude works on PROMPT"]
    WORK --> STOP["Claude tries to stop"]
    STOP --> HOOK["Stop hook fires"]
    HOOK --> CHECK{"State file<br/>exists?"}
    CHECK -->|No| EXIT["Allow exit"]
    CHECK -->|Yes| ITER{"Max iterations<br/>reached?"}
    ITER -->|Yes| EXIT
    ITER -->|No| PROMISE{"Completion<br/>promise met?"}
    PROMISE -->|Yes| EXIT
    PROMISE -->|No| REINJECT["Block stop<br/>Re-inject PROMPT"]
    REINJECT --> |"iteration + 1"| WORK

    style HOOK fill:#FF5722,color:#fff
    style REINJECT fill:#E53935,color:#fff

The state file (.claude/ralph-loop.local.md) uses YAML frontmatter to track loop state:

---
iteration: 3
max_iterations: 10
completion_promise: "All tests pass"
---

The actual prompt text goes here...

The hook parses this frontmatter, increments the iteration counter, and checks the completion promise — a text string that Claude must output inside <promise> tags when the stated condition is genuinely true. The hook extracts promise text with a Perl regex at line 119 and uses literal string comparison (not glob matching) to verify:

plugins/ralph-wiggum/hooks/stop-hook.sh#L114-L128

The command file reinforces the integrity constraint: "You may ONLY output it when the statement is completely and unequivocally TRUE. Do not output false promises to escape the loop."

When blocking the stop, the hook outputs a JSON decision:

plugins/ralph-wiggum/hooks/stop-hook.sh#L167-L174

jq -n \\
  --arg prompt \"$PROMPT_TEXT\" \\
  --arg msg \"$SYSTEM_MSG\" \\
  '{
    \"decision\": \"block\",
    \"reason\": $prompt,
    \"systemMessage\": $msg
  }'

The "reason" field contains the prompt that gets re-injected. The "systemMessage" provides iteration metadata: "🔄 Ralph iteration 4 | To stop: output <promise>All tests pass</promise>".

Tip: The ralph-wiggum pattern is powerful for tasks like "keep improving this until the tests pass" or "refactor this module until the code quality score is above 90." The completion promise mechanism ensures Claude can't game its way out of the loop.

Orchestration Patterns Compared

These three patterns represent fundamentally different approaches to multi-agent coordination:

Dimension Feature-Dev Code-Review Ralph Wiggum
Pattern Phased human-in-the-loop Validation pipeline Self-referential loop
Human involvement 2 mandatory gates (phases 3, 5) Optional (--comment flag) None after launch
Agent parallelism Broad (2-3 per phase) Narrow (4 reviewers + per-issue validators) Sequential (one agent looping)
Model tiers Single tier (Sonnet) Mixed (Haiku→Sonnet→Opus) Inherited from session
Termination Natural (7 phases complete) Natural (pipeline ends) Promise-based or max iterations
Cost profile Medium (Sonnet × parallel count) High (Opus for bug detection) Variable (depends on iterations)
Best for Complex new features PR review with high signal Iterative improvement tasks

The feature-dev pattern prioritizes human control — it's designed for complex decisions where bad assumptions compound. The code-review pattern prioritizes signal quality — the validation step exists specifically to filter noise. The ralph-wiggum pattern prioritizes autonomous iteration — it's for tasks where the completion criteria are clear but the path isn't.

flowchart LR
    subgraph "Human Control ←→ Autonomy"
        FD["Feature-Dev<br/>2 human gates"] --> CR["Code-Review<br/>Optional comment"]
        CR --> RW["Ralph Wiggum<br/>Fully autonomous"]
    end

    subgraph "Cost ←→ Quality"
        CHEAP["Haiku gates<br/>(cheap)"] --> MID["Sonnet analysis<br/>(moderate)"]
        MID --> EXPENSIVE["Opus reasoning<br/>(expensive)"]
    end

What's Next

We've seen how commands orchestrate agents into workflows. In Part 4, we'll go deep on the hook system — examining three real-world implementations that showcase the full spectrum of what's possible. The security-guidance plugin monitors nine security antipatterns with probabilistic state cleanup. Hookify implements a configurable rule engine with a custom YAML parser and LRU-cached regular expression compilation. And the explanatory-output-style plugin demonstrates that sometimes the most powerful hook is just 15 lines of shell script.