Why fan-out alone isn't enough
The naive version of multi-agent code review is simple: run three reviewers in parallel, collect their findings, return the union. This is Pattern 3 (Parallelisation) applied mechanically.
It's better than a single agent, but it misses something. A security reviewer focused on injection and auth will look at a function differently from an architecture reviewer looking at coupling and separation of concerns. Both might notice the same vulnerable pattern -- one will call it an injection risk, the other will call it missing validation. Without a cross-examination pass, you get duplicate findings at different severity levels and no way to know which assessment is right.
The reflection pass solves this. Each reviewer sees the other two's findings and is asked: do any of these change your assessment? Are there findings you missed? Would you upgrade or downgrade any of your own findings in light of what your colleagues found?
The five-step pipeline
Step 1: Parse input (PR URL or raw diff)
Step 2: Fan-out (parallel)
Security reviewer (Sonnet) -- injection, auth, secrets, crypto
Architecture reviewer (Sonnet) -- SOLID, coupling, error handling, API design
Style reviewer (Haiku) -- naming, docs, complexity, dead code
Step 3: Collect findings
Graceful degradation: if one reviewer fails, the other two continue
Step 4: Reflection pass (parallel)
Each reviewer reads the other two's findings
Returns: list of amendments (upgrade / downgrade / retract / add)
Step 5: Synthesis (Sonnet)
Deduplicate overlapping findings
Apply amendments
Order by severity
Produce unified verdict: approve / request_changes / needs_discussion
Steps 2 and 4 use ThreadPoolExecutor with max_workers=3.
Step 3 is a simple collection pass with error checking. Step 5 is sequential --
the Synthesiser needs all amended findings before it can deduplicate.
Graceful degradation
One of the patterns the spec requires is graceful degradation if a reviewer fails (Pattern 12: Exception Handling). In practice this means:
- If the security reviewer times out or raises, its
ReviewResulthaserrorset and empty findings. - The orchestrator checks how many reviewers succeeded. If zero, return an error. If one or two, continue with the reflection pass using the available findings.
- The reflection pass itself handles the case where only one reviewer's findings are available -- it skips the cross-examination for that reviewer and logs a warning.
- The frontend shows a warning if any reviewer failed, but still renders the successful reviewers' output.
This is a real design choice. The alternative is to fail the entire pipeline if any single reviewer fails. That's simpler but it means a Sonnet timeout on the security reviewer throws away perfectly good architecture and style findings. The graceful degradation approach delivers partial value instead of nothing.
The finding schema
All three reviewers are given the same output schema. Structured output is Pattern 18 (Guardrails) -- the schema is the guardrail that ensures findings are comparable:
{
"category": "security | architecture | style",
"severity": "critical | high | medium | low | info",
"file": "app/auth.py",
"line": 13,
"title": "SQL injection in login function",
"description": "User input concatenated directly into SQL query.",
"suggestion": "Use parameterised queries via %s placeholders.",
"confidence": 0.95
}
The confidence field is the most useful part. A security reviewer with
0.65 confidence on a potential SSRF should be treated differently from one with 0.95
confidence on a clear SQL injection. The Synthesiser uses confidence to decide whether
borderline findings make the unified review, and the frontend can filter below a
threshold if needed.
What the reflection prompt looks like
Each reviewer in the reflection pass gets a prompt structured like this:
YOUR FINDINGS:
[...their original findings as JSON...]
COLLEAGUE 1 (security reviewer):
[...security findings...]
COLLEAGUE 2 (style reviewer):
[...style findings...]
Review these cross-findings. Respond with a JSON array of AMENDMENTS.
An amendment has: action (add|upgrade|downgrade|retract),
original_title, updated_severity, reason, new_finding.
The amendment types are worth explaining:
- upgrade -- raise the severity of an existing finding based on colleague input
- downgrade -- lower the severity (the colleague's context suggests it's less serious)
- retract -- remove the finding (colleague showed it's a false positive)
- add -- add a new finding that the reviewer missed in the first pass
The Synthesiser sees both the original findings and the amendments. It applies them before deduplication. A finding upgraded by two different reviewers after reflection is a strong signal that the Synthesiser should include it prominently.
Model routing and cost
The cost model is explicit in the config:
- Security reviewer: Sonnet (~$0.02 per review at 2K tokens in)
- Architecture reviewer: Sonnet (~$0.02)
- Style reviewer: Haiku (~$0.001 -- 20x cheaper)
- Reflection pass: 3 x Sonnet calls (~$0.03 combined)
- Synthesis: Sonnet (~$0.02 at 4K tokens in)
- Total: ~$0.09 per review
Style findings are low-stakes. A misnamed variable is not worth $0.02. Haiku is fast and cheap and produces perfectly adequate style analysis. Security and architecture findings can block a deployment or introduce a production incident. Sonnet's extra capacity is worth the cost for those categories.
At 20 reviews per day (the global daily limit), total LLM spend is about $1.80/day. The SSM kill switch at 20/day ensures this doesn't drift.
What the patterns are actually doing
This demo implements seven of the 21 Gulli patterns:
- Parallelisation (3) -- three reviewers in parallel, then three reflection calls in parallel
- Reflection (4) -- each reviewer critiques their own output in light of others'
- Multi-agent coordination (7) -- three specialist reviewers with no overlap in focus areas, then a coordinator Synthesiser
- Exception handling (12) -- graceful degradation when a reviewer fails
- Inter-agent communication (15) -- findings passed between agents in structured JSON
- Resource-aware (16) -- Haiku for style, Sonnet for security and architecture
- Guardrails (18) -- structured output schema enforces finding format across all agents
ticketyboo brings governed AI development to your pull request workflow. 5 governance runs free, one-time welcome grant. No card required.
View pricing Start free →