Multi-agent code review: reflection in practice

Summary

Who it's for Engineering teams using AI for code review who've noticed that a single-agent review misses cross-cutting concerns.

3 key takeaways

A security agent and an architecture agent reviewing the same diff will catch different things. Running them in parallel is cheap. The useful part is what happens when they read each other's findings.
The reflection pass (Pattern 4) is not optional decoration. A security reviewer who sees that the architecture reviewer flagged "missing input validation" will often upgrade their own severity assessment -- or find something they missed.
Model routing pays for itself: Haiku for style, Sonnet for security and architecture. The marginal cost difference between Haiku and Sonnet is worth it for findings that could block a deployment.

~7 min read

Three specialists, one reflection pass. The cross-examination catches what individuals miss.

Why fan-out alone isn't enough

The naive version of multi-agent code review is simple: run three reviewers in parallel, collect their findings, return the union. This is Pattern 3 (Parallelisation) applied mechanically.

It's better than a single agent, but it misses something. A security reviewer focused on injection and auth will look at a function differently from an architecture reviewer looking at coupling and separation of concerns. Both might notice the same vulnerable pattern -- one will call it an injection risk, the other will call it missing validation. Without a cross-examination pass, you get duplicate findings at different severity levels and no way to know which assessment is right.

The reflection pass solves this. Each reviewer sees the other two's findings and is asked: do any of these change your assessment? Are there findings you missed? Would you upgrade or downgrade any of your own findings in light of what your colleagues found?

The five-step pipeline

Step 1: Parse input (PR URL or raw diff)

Step 2: Fan-out (parallel)
  Security reviewer    (Sonnet) -- injection, auth, secrets, crypto
  Architecture reviewer (Sonnet) -- SOLID, coupling, error handling, API design
  Style reviewer       (Haiku)  -- naming, docs, complexity, dead code

Step 3: Collect findings
  Graceful degradation: if one reviewer fails, the other two continue

Step 4: Reflection pass (parallel)
  Each reviewer reads the other two's findings
  Returns: list of amendments (upgrade / downgrade / retract / add)

Step 5: Synthesis (Sonnet)
  Deduplicate overlapping findings
  Apply amendments
  Order by severity
  Produce unified verdict: approve / request_changes / needs_discussion

Steps 2 and 4 use ThreadPoolExecutor with max_workers=3. Step 3 is a simple collection pass with error checking. Step 5 is sequential -- the Synthesiser needs all amended findings before it can deduplicate.

Graceful degradation

One of the patterns the spec requires is graceful degradation if a reviewer fails (Pattern 12: Exception Handling). In practice this means:

If the security reviewer times out or raises, its ReviewResult has error set and empty findings.
The orchestrator checks how many reviewers succeeded. If zero, return an error. If one or two, continue with the reflection pass using the available findings.
The reflection pass itself handles the case where only one reviewer's findings are available -- it skips the cross-examination for that reviewer and logs a warning.
The frontend shows a warning if any reviewer failed, but still renders the successful reviewers' output.

This is a real design choice. The alternative is to fail the entire pipeline if any single reviewer fails. That's simpler but it means a Sonnet timeout on the security reviewer throws away perfectly good architecture and style findings. The graceful degradation approach delivers partial value instead of nothing.

The finding schema

All three reviewers are given the same output schema. Structured output is Pattern 18 (Guardrails) -- the schema is the guardrail that ensures findings are comparable:

{
  "category":   "security | architecture | style",
  "severity":   "critical | high | medium | low | info",
  "file":       "app/auth.py",
  "line":       13,
  "title":      "SQL injection in login function",
  "description": "User input concatenated directly into SQL query.",
  "suggestion": "Use parameterised queries via %s placeholders.",
  "confidence": 0.95
}

The confidence field is the most useful part. A security reviewer with 0.65 confidence on a potential SSRF should be treated differently from one with 0.95 confidence on a clear SQL injection. The Synthesiser uses confidence to decide whether borderline findings make the unified review, and the frontend can filter below a threshold if needed.

Reflection changes the output. Findings get upgraded, downgraded, or withdrawn.

What the reflection prompt looks like

Each reviewer in the reflection pass gets a prompt structured like this:

YOUR FINDINGS:
[...their original findings as JSON...]

COLLEAGUE 1 (security reviewer):
[...security findings...]

COLLEAGUE 2 (style reviewer):
[...style findings...]

Review these cross-findings. Respond with a JSON array of AMENDMENTS.
An amendment has: action (add|upgrade|downgrade|retract),
original_title, updated_severity, reason, new_finding.

The amendment types are worth explaining:

upgrade -- raise the severity of an existing finding based on colleague input
downgrade -- lower the severity (the colleague's context suggests it's less serious)
retract -- remove the finding (colleague showed it's a false positive)
add -- add a new finding that the reviewer missed in the first pass

The Synthesiser sees both the original findings and the amendments. It applies them before deduplication. A finding upgraded by two different reviewers after reflection is a strong signal that the Synthesiser should include it prominently.

Model routing and cost

The cost model is explicit in the config:

Security reviewer: Sonnet (~$0.02 per review at 2K tokens in)
Architecture reviewer: Sonnet (~$0.02)
Style reviewer: Haiku (~$0.001 -- 20x cheaper)
Reflection pass: 3 x Sonnet calls (~$0.03 combined)
Synthesis: Sonnet (~$0.02 at 4K tokens in)
Total: ~$0.09 per review

Style findings are low-stakes. A misnamed variable is not worth $0.02. Haiku is fast and cheap and produces perfectly adequate style analysis. Security and architecture findings can block a deployment or introduce a production incident. Sonnet's extra capacity is worth the cost for those categories.

At 20 reviews per day (the global daily limit), total LLM spend is about $1.80/day. The SSM kill switch at 20/day ensures this doesn't drift.

What the patterns are actually doing

This demo implements seven of the 21 Gulli patterns:

Parallelisation (3) -- three reviewers in parallel, then three reflection calls in parallel
Reflection (4) -- each reviewer critiques their own output in light of others'
Multi-agent coordination (7) -- three specialist reviewers with no overlap in focus areas, then a coordinator Synthesiser
Exception handling (12) -- graceful degradation when a reviewer fails
Inter-agent communication (15) -- findings passed between agents in structured JSON
Resource-aware (16) -- Haiku for style, Sonnet for security and architecture
Guardrails (18) -- structured output schema enforces finding format across all agents

Try it: The Code Review Pipeline is live. Paste a diff or enter a public GitHub PR URL. Three specialists will review it, cross-examine each other, and produce a unified verdict.

Reference: Antonio Gulli, Agentic Design Patterns with Claude (Anthropic, 2025). Patterns demonstrated: 3, 4, 7, 12, 15, 16, 18.

Try the Code Review Pipeline →

ticketyboo brings governed AI development to your pull request workflow. 5 governance runs free, one-time welcome grant. No card required.

View pricing Start free →