Improving Factuality and Reasoning in Language Models through Multiagent Debate
Authors: Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, Igor Mordatch Affiliation: MIT / Google Brain Canonical URL: https://arxiv.org/abs/2305.14325 Project page: https://composable-models.github.io/llm_debate/
Summary
The foundational paper on multi-agent debate as an LLM reasoning protocol. Multiple instances of a language model propose answers and then critique each other’s reasoning over multiple rounds before converging on a final answer. This significantly improves mathematical and strategic reasoning, and reduces hallucinations and factual errors compared to single-agent generation.
Direct empirical evidence for H5 (multi-agent SAT pipelines outperform single-agent chains). The mechanism Du et al. describe is structurally identical to the groupthink-control argument.
Key Findings
- Multi-round debate improves factuality. Reduces wrong answers across math, reasoning, and factual QA benchmarks.
- Debate reduces hallucination. Multiple instances catch and correct each other’s confabulated content.
- Black-box compatible. Works with closed API models without fine-tuning.
- Uniform across tasks. Same procedure and prompts work across diverse evaluation tasks.
- “Society of minds” framing. Reframes LLM reasoning as a collaborative process rather than a single forward pass.
Important Caveat (Per This Wiki’s Critique)
Du et al.’s “multiple instances” share the same base model. From a groupthink perspective, this is like polling identical analysts — they share priors, biases, and failure modes. The improvement Du et al. measure may come more from the iterative-critique structure than from genuine epistemic independence.
This is exactly the caveat raised in H5: a true test of multi-agent SAT pipelines would compare same-model debate vs. different-model debate to isolate the structure-vs-independence contribution.
Relevance to This Wiki
- Direct partial validation of H5. Confirms multi-agent debate improves output quality; does not yet isolate which component matters.
- Empirical support for Groupthink countermeasures in LLM contexts.
- Architectural pattern for SAT Pipeline Pattern B (parallel + adversarial).
- Methodological precedent. Multi-round, multi-instance protocols are tractable to evaluate and improve performance.
See Also
- Groupthink
- Devil’s Advocacy — structurally similar single-agent technique
- Team B — the SAT equivalent of debate
- H5 — hypothesis page
- Liang et al. (2023) Encouraging Divergent Thinking in LLMs through Multi-Agent Debate (arXiv:2305.19118) — follow-up