Structured deliberation: when agents argue, you get better answers

Summary

Who it's for Engineers using LLMs to inform technical decisions who've noticed that a single model tends to agree with the framing of the question.

3 key takeaways

A single LLM asked "should we do X?" will usually argue for X. Separate Advocate and Critic agents with role-locked system prompts produce genuinely opposing arguments.
Two rounds of debate -- opening arguments, then cross-rebuttals -- surface the strongest counterarguments. A Synthesiser reading all four arguments produces a more honest recommendation than any single agent could.
Model routing matters: Haiku for Advocate and Critic (fast, cheap, directional), Sonnet for Synthesiser (slower, more nuanced). Total cost per deliberation is around $0.05.

~8 min read

The problem with asking one model

Ask a large language model "should we migrate from a monolith to microservices?" and it will almost certainly say yes, then hedge. The model is pattern-matching on the framing of the question. "Should we X?" implies the asker is considering X, and the model obliges.

This is not a hallucination problem. It's a sycophancy problem. The model is optimised to produce responses that feel useful and agreeable. Asking it to steelman opposing positions in the same response produces a weaker version of both -- the model is hedging, not arguing.

Structured deliberation fixes this by separating the roles. Three agents, three distinct system prompts, three different jobs:

Advocate -- argues FOR. Its job is to make the strongest possible case for the proposition. Acknowledging weaknesses is allowed, but only to explain why the benefits outweigh them.
Critic -- argues AGAINST. Its job is to find the strongest counterarguments: risks, costs, hidden assumptions, failure modes. No strawmanning the Advocate.
Synthesiser -- reads all four arguments and produces a balanced recommendation with a confidence level and explicit trade-offs.

The three-round structure

The deliberation runs in three rounds. Rounds 1 and 2 are parallel -- Advocate and Critic run concurrently using Python's ThreadPoolExecutor. Round 3 is sequential -- the Synthesiser only runs after all four debate arguments are available.

Round 1 (parallel):
  Advocate opens -- argues FOR the proposition
  Critic opens   -- argues AGAINST the proposition

Round 2 (parallel):
  Advocate rebuts the Critic's Round 1 argument
  Critic rebuts the Advocate's Round 1 argument

Round 3 (sequential):
  Synthesiser reads all four arguments
  Produces recommendation + confidence + trade-offs

The cross-rebuttal in Round 2 is the key insight. The Advocate's Round 2 prompt explicitly includes the Critic's Round 1 argument: "The Critic made the following arguments. Respond to their strongest points. Where are they right? Where are they wrong? What did they miss?" The Critic gets the mirror image.

This surfaces the strongest objections on both sides. A Critic that has to respond to "the benefits outweigh the costs because X" produces more targeted counterarguments than a Critic operating in isolation.

Why model routing matters

Advocate and Critic use a small, fast model. They're generating directional arguments: persuasive, specific, but not requiring deep synthesis. A small model is fast and cheap: around $0.001 per Advocate call at typical token counts.

Synthesiser uses a mid-tier model. It's reading 3,000 to 5,000 tokens of debate and producing a nuanced recommendation. That requires more capacity. The mid-tier model costs roughly 10x more per token, but there's only one Synthesiser call per deliberation.

Total cost per deliberation at typical token counts:

Round 1: two Haiku calls (~$0.002 combined)
Round 2: two Haiku calls (~$0.003 combined)
Round 3: one Sonnet call (~$0.04)
Total: ~$0.05 per deliberation

At 20 deliberations per day -- generous for a portfolio site -- that's $1/day. A kill switch in SSM Parameter Store lets the demo be paused if costs spike.

The architecture

POST /                   start a deliberation
GET  /session/{id}        fetch a completed session
GET  /recent              return 5 most recent sessions

Lambda handler
  → orchestrator.run_deliberation(question)
       Round 1: ThreadPoolExecutor(advocate.generate, critic.generate)
         Each writes to DynamoDB immediately on completion
       Round 2: ThreadPoolExecutor(advocate.generate, critic.generate)
         Advocate gets critic_round_1 in context
         Critic gets advocate_round_1 in context
       Round 3: synthesiser.generate(all four arguments)
         Sonnet reads all four, returns structured recommendation

DynamoDB: deliberation-sessions table
  SESSION#{id}  META     -- question, status, cost, timestamps
  SESSION#{id}  ROUND#1#AGENT#advocate
  SESSION#{id}  ROUND#1#AGENT#critic
  SESSION#{id}  ROUND#2#AGENT#advocate
  SESSION#{id}  ROUND#2#AGENT#critic
  SESSION#{id}  ROUND#3#AGENT#synthesiser
  RATELIMIT#{ip}  {ts}  -- 1-hour TTL, max 5/IP/hour
  DAILY#usage    {date} -- atomic counter, daily limit

The Lambda is exposed via Lambda Function URL -- no API Gateway. The deliberation Lambda runs synchronously: one POST call, one response containing all five arguments. The frontend populates the three columns with a small visual delay between rounds to make the debate structure legible.

What the Synthesiser prompt asks for

The Synthesiser receives a structured prompt with all four arguments labelled by round and agent. It's asked to produce six things:

Recommendation (1-2 sentences)
Confidence: low / medium / high (and why)
Strongest arguments FOR
Strongest arguments AGAINST
Key trade-offs
Conditions that would change the recommendation

The conditions section is the most useful part. A recommendation of "defer microservices migration until the team exceeds 20 engineers" is only useful if it also tells you what changes when the team hits 20. The structured output forces this.

What doesn't work

The deliberation is only as good as the question. Vague questions produce vague arguments. "Should we use AI?" is too broad to deliberate productively. "Should a 50-person e-commerce company adopt AI coding assistants for a team that has no existing ML infrastructure?" is much better.

The Advocate and Critic are role-locked but they're still language models. On genuinely one-sided questions -- "should we write code with no tests?" -- the Advocate will struggle to make a compelling case and both agents know it. The quality of the debate tracks the quality of the question.

Parallelisation adds latency variance. Round 1 completes when the slower of the two agents finishes. On a bad network day, one Haiku call can take 8 seconds. The total deliberation time ranges from 30 to 90 seconds depending on model latency.

Three rounds, three agents. The synthesis is better than any single opinion.

Model routing in action. Cheap models argue, the expensive model decides.

The patterns in use

This demo implements ten of the 21 Gulli agentic patterns:

Prompt chains -- each round's output becomes the next round's input
Parallelisation -- Advocate and Critic run concurrently in both rounds
Reflection -- Round 2 is a structured self-critique of the opposing argument
Planning -- the three-round structure is a fixed plan executed deterministically
Multi-agent coordination -- three specialised agents with distinct roles and no overlap
Memory -- all arguments persist to DynamoDB, previous debates are retrievable
Agent communication -- Round 2 agents receive each other's Round 1 arguments in context
Model routing -- Haiku for debate, Sonnet for synthesis
Guardrails -- rate limiting, kill switch, token budgets, input length cap
Structured reasoning -- the Synthesiser prompt forces chain-of-thought via numbered sections

None of these patterns are exotic. They're composable primitives. The deliberation engine is what you get when you stack ten of them on top of a question and a cost budget.

Try it: The Deliberation Engine is live. Enter any technical proposition. Advocate and Critic will argue both sides. Synthesiser will tell you what to do -- and what would change the answer.

Reference: Antonio Gulli, Agentic Design Patterns with Claude (Anthropic, 2025). Patterns demonstrated: 1, 3, 4, 6, 7, 8, 15, 16, 17, 20.

Try the Deliberation Engine →

Scan any public GitHub repo for dependency risk, secrets, and code quality issues — free, no account needed.

Scan a repo free See governance agents →