Improving Factuality and Reasoning in Language Models through Multiagent Debate

Authors: Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, Igor Mordatch Affiliation: MIT / Google Brain Canonical URL: https://arxiv.org/abs/2305.14325 Project page: https://composable-models.github.io/llm_debate/

Summary

The foundational paper on multi-agent debate as an LLM reasoning protocol. Multiple instances of a language model propose answers and then critique each other’s reasoning over multiple rounds before converging on a final answer. This significantly improves mathematical and strategic reasoning, and reduces hallucinations and factual errors compared to single-agent generation.

Direct empirical evidence for H5 (multi-agent SAT pipelines outperform single-agent chains). The mechanism Du et al. describe is structurally identical to the groupthink-control argument.

Key Findings

Multi-round debate improves factuality. Reduces wrong answers across math, reasoning, and factual QA benchmarks.
Debate reduces hallucination. Multiple instances catch and correct each other’s confabulated content.
Black-box compatible. Works with closed API models without fine-tuning.
Uniform across tasks. Same procedure and prompts work across diverse evaluation tasks.
“Society of minds” framing. Reframes LLM reasoning as a collaborative process rather than a single forward pass.

Important Caveat (Per This Wiki’s Critique)

Du et al.’s “multiple instances” share the same base model. From a groupthink perspective, this is like polling identical analysts — they share priors, biases, and failure modes. The improvement Du et al. measure may come more from the iterative-critique structure than from genuine epistemic independence.

This is exactly the caveat raised in H5: a true test of multi-agent SAT pipelines would compare same-model debate vs. different-model debate to isolate the structure-vs-independence contribution.

Relevance to This Wiki

Direct partial validation of H5. Confirms multi-agent debate improves output quality; does not yet isolate which component matters.
Empirical support for Groupthink countermeasures in LLM contexts.
Architectural pattern for SAT Pipeline Pattern B (parallel + adversarial).
Methodological precedent. Multi-round, multi-instance protocols are tractable to evaluate and improve performance.

Du et al. — Improving Factuality and Reasoning in Language Models through Multiagent Debate (2023)

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Summary

Key Findings

Important Caveat (Per This Wiki’s Critique)

Relevance to This Wiki

See Also

Graph View

Table of Contents

Backlinks