H5 — Multi-Agent SAT Pipelines Outperform Single-Agent Chains

Claim

A pipeline with separate agents for (a) claim generation and (b) adversarial critique produces higher quality outputs than a single agent doing both, because multi-agent separation prevents self-consistency pressure from suppressing genuine challenge.

Why Groupthink Is the Target

Groupthink in LLM systems emerges when a single model’s prior outputs anchor subsequent generation — the model becomes motivated to maintain consistency with itself. Separate agents with no shared context lack this self-consistency pressure. The architectural analog of the Team B technique.

Experimental Setup

Same analytical task run in:
- Condition A: single agent chain (generate → self-critique → revise, all in one context)
- Condition B: two-agent pipeline (generator + separate critic; critic has no access to generator’s reasoning chain, only its final output)
- Condition C (control for shared base model): condition B but using different base models for generator and critic
Blind human raters score final output quality and intellectual honesty
Measure rate of caught contradictions, revised conclusions, acknowledged uncertainty

What to Measure

Rate of self-revision in the pipeline
Quality of caught errors (substantive vs. cosmetic)
Whether conclusions actually change between generation and post-critique, or just get reworded
The critical comparison: how much of B-over-A improvement is retained in C? (Tests whether the structure or the genuine independence is doing the work)

Why It Could Fail

If both agents share the same base model, they may have identical biases and miss the same things — no genuine epistemic independence. The critique agent may find only surface errors while missing the shared blind spots. This is what condition C is designed to expose.

Empirical Evidence

Partial validation. Du 2023 confirms the structure works; the open question is whether genuine independence matters.

Source	Finding	Implication
Du et al. (MIT, 2023)	Multi-instance multi-round debate significantly improves factuality and reasoning, reduces hallucination, across math, reasoning, factual QA benchmarks	Direct partial validation of H5 (conditions A vs B)
Du et al. (2023) caveat	All “instances” in Du’s experiments share the same base model	The critical question H5 asks — does genuine independence matter? — is exactly what Du does not test
Liang et al. (2023) Encouraging Divergent Thinking via Multi-Agent Debate	Multi-agent debate increases response diversity	Suggests structure-alone contributes, but again same-model

The C condition is the key contribution of H5. No published study compares same-model multi-agent vs. different-model multi-agent. If different-model significantly outperforms same-model, this confirms genuine groupthink-control matters. If they’re similar, the win is the structure (which is still useful but means cheaper architectures suffice).

H5 — Multi-Agent SAT Pipelines Outperform Single-Agent Chains

Claim

Why Groupthink Is the Target

Experimental Setup

What to Measure

Why It Could Fail

Empirical Evidence

See Also

Graph View

Table of Contents

Backlinks