The problem with asking one model
Ask a large language model "should we migrate from a monolith to microservices?" and it will almost certainly say yes, then hedge. The model is pattern-matching on the framing of the question. "Should we X?" implies the asker is considering X, and the model obliges.
This is not a hallucination problem. It's a sycophancy problem. The model is optimised to produce responses that feel useful and agreeable. Asking it to steelman opposing positions in the same response produces a weaker version of both -- the model is hedging, not arguing.
Structured deliberation fixes this by separating the roles. Three agents, three distinct system prompts, three different jobs:
- Advocate -- argues FOR. Its job is to make the strongest possible case for the proposition. Acknowledging weaknesses is allowed, but only to explain why the benefits outweigh them.
- Critic -- argues AGAINST. Its job is to find the strongest counterarguments: risks, costs, hidden assumptions, failure modes. No strawmanning the Advocate.
- Synthesiser -- reads all four arguments and produces a balanced recommendation with a confidence level and explicit trade-offs.
The three-round structure
The deliberation runs in three rounds. Rounds 1 and 2 are parallel -- Advocate and Critic
run concurrently using Python's ThreadPoolExecutor. Round 3 is sequential --
the Synthesiser only runs after all four debate arguments are available.
Round 1 (parallel):
Advocate opens -- argues FOR the proposition
Critic opens -- argues AGAINST the proposition
Round 2 (parallel):
Advocate rebuts the Critic's Round 1 argument
Critic rebuts the Advocate's Round 1 argument
Round 3 (sequential):
Synthesiser reads all four arguments
Produces recommendation + confidence + trade-offs
The cross-rebuttal in Round 2 is the key insight. The Advocate's Round 2 prompt explicitly includes the Critic's Round 1 argument: "The Critic made the following arguments. Respond to their strongest points. Where are they right? Where are they wrong? What did they miss?" The Critic gets the mirror image.
This surfaces the strongest objections on both sides. A Critic that has to respond to "the benefits outweigh the costs because X" produces more targeted counterarguments than a Critic operating in isolation.
Why model routing matters
Advocate and Critic use a small, fast model. They're generating directional arguments: persuasive, specific, but not requiring deep synthesis. A small model is fast and cheap: around $0.001 per Advocate call at typical token counts.
Synthesiser uses a mid-tier model. It's reading 3,000 to 5,000 tokens of debate and producing a nuanced recommendation. That requires more capacity. The mid-tier model costs roughly 10x more per token, but there's only one Synthesiser call per deliberation.
Total cost per deliberation at typical token counts:
- Round 1: two Haiku calls (~$0.002 combined)
- Round 2: two Haiku calls (~$0.003 combined)
- Round 3: one Sonnet call (~$0.04)
- Total: ~$0.05 per deliberation
At 20 deliberations per day -- generous for a portfolio site -- that's $1/day. A kill switch in SSM Parameter Store lets the demo be paused if costs spike.
The architecture
POST / start a deliberation
GET /session/{id} fetch a completed session
GET /recent return 5 most recent sessions
Lambda handler
→ orchestrator.run_deliberation(question)
Round 1: ThreadPoolExecutor(advocate.generate, critic.generate)
Each writes to DynamoDB immediately on completion
Round 2: ThreadPoolExecutor(advocate.generate, critic.generate)
Advocate gets critic_round_1 in context
Critic gets advocate_round_1 in context
Round 3: synthesiser.generate(all four arguments)
Sonnet reads all four, returns structured recommendation
DynamoDB: deliberation-sessions table
SESSION#{id} META -- question, status, cost, timestamps
SESSION#{id} ROUND#1#AGENT#advocate
SESSION#{id} ROUND#1#AGENT#critic
SESSION#{id} ROUND#2#AGENT#advocate
SESSION#{id} ROUND#2#AGENT#critic
SESSION#{id} ROUND#3#AGENT#synthesiser
RATELIMIT#{ip} {ts} -- 1-hour TTL, max 5/IP/hour
DAILY#usage {date} -- atomic counter, daily limit
The Lambda is exposed via Lambda Function URL -- no API Gateway. The deliberation Lambda runs synchronously: one POST call, one response containing all five arguments. The frontend populates the three columns with a small visual delay between rounds to make the debate structure legible.
What the Synthesiser prompt asks for
The Synthesiser receives a structured prompt with all four arguments labelled by round and agent. It's asked to produce six things:
- Recommendation (1-2 sentences)
- Confidence: low / medium / high (and why)
- Strongest arguments FOR
- Strongest arguments AGAINST
- Key trade-offs
- Conditions that would change the recommendation
The conditions section is the most useful part. A recommendation of "defer microservices migration until the team exceeds 20 engineers" is only useful if it also tells you what changes when the team hits 20. The structured output forces this.
What doesn't work
The deliberation is only as good as the question. Vague questions produce vague arguments. "Should we use AI?" is too broad to deliberate productively. "Should a 50-person e-commerce company adopt AI coding assistants for a team that has no existing ML infrastructure?" is much better.
The Advocate and Critic are role-locked but they're still language models. On genuinely one-sided questions -- "should we write code with no tests?" -- the Advocate will struggle to make a compelling case and both agents know it. The quality of the debate tracks the quality of the question.
Parallelisation adds latency variance. Round 1 completes when the slower of the two agents finishes. On a bad network day, one Haiku call can take 8 seconds. The total deliberation time ranges from 30 to 90 seconds depending on model latency.
The patterns in use
This demo implements ten of the 21 Gulli agentic patterns:
- Prompt chains -- each round's output becomes the next round's input
- Parallelisation -- Advocate and Critic run concurrently in both rounds
- Reflection -- Round 2 is a structured self-critique of the opposing argument
- Planning -- the three-round structure is a fixed plan executed deterministically
- Multi-agent coordination -- three specialised agents with distinct roles and no overlap
- Memory -- all arguments persist to DynamoDB, previous debates are retrievable
- Agent communication -- Round 2 agents receive each other's Round 1 arguments in context
- Model routing -- Haiku for debate, Sonnet for synthesis
- Guardrails -- rate limiting, kill switch, token budgets, input length cap
- Structured reasoning -- the Synthesiser prompt forces chain-of-thought via numbered sections
None of these patterns are exotic. They're composable primitives. The deliberation engine is what you get when you stack ten of them on top of a question and a cost budget.
Scan any public GitHub repo for dependency risk, secrets, and code quality issues — free, no account needed.
Scan a repo free See governance agents →