Claim

The multi-step ACH protocol produces conclusions that better match ground truth than single-prompt “what’s most likely?” queries on identical evidence.

Why Confirmation Bias Is the Target

Without structure, LLMs anchor on the most salient hypothesis in the prompt and interpret subsequent evidence in its favor — a direct analog of Confirmation Bias. ACH is designed to break this by forcing the model to evaluate evidence against every hypothesis before ranking, rather than evaluating in support of a leading candidate.

Experimental Setup

  • Use cases with known outcomes: security incident post-mortems, historical intelligence failures, engineering post-mortems
  • Condition A: free-form “what is the most likely explanation?”
  • Condition B: ACH protocol — (1) generate all plausible hypotheses, (2) score each piece of evidence against each hypothesis independently in separate calls, (3) rank by diagnosticity
  • Blind human raters score conclusion accuracy and quality of evidence handling

What to Measure

Does ACH produce more accurate conclusions? Does it catch the correct explanation more often when it’s not the most salient/obvious one — i.e., when the conventional-wisdom answer is wrong?

Why It Could Fail

LLMs may generate multiple hypotheses in step 1 then collapse back to the most probable-looking one in scoring. Structural compliance without genuine multi-hypothesis tracking. See H0 — this is the concrete failure mode H0 warns about, specifically for ACH.

Empirical Evidence

Partial caution from human studies (no LLM ACH-vs-baseline study yet exists).

SourceFindingImplication
RAND RR1408 (2016) citing Cheikes et al. (Mitre, 2004)Human ACH reduced confirmation bias only among non-professional analysts; expert analysts already exhibited what ACH formalizesVariant test: does ACH help a general-purpose LLM more than a domain-tuned one?
Echterhoff et al. (BiasBuster, 2024)Confirmation bias is directly measurable in LLM decision-making across commercial and open-source modelsConfirms the bias H1 targets is real and present
Roberts (2025)Single-prompt ACH is an anti-pattern — the model generates a narrative first and forces evidence to fit. Multi-step sequential ACH works.Step-ordering must be architecturally enforced, not just prompt-described
suprathermal — ACH-Grounding (2024)Independent open-source ACH+RAG implementation converges on the same multi-step pattern, and goes further: LLM only fills matrix cells; classical algorithms do synthesis. Cites provable hallucination bounds (arXiv:2401.11817) as motivation.Two independent implementations arriving at the same architecture is empirical evidence for the design principle. Suggests H1 should be tested in cell-level form, not single-prompt form.

**The Mitre 2004 caveat is the most important finding.**It implies that a positive H1 result might depend on which model is used — a generalist model might benefit more than a finetuned domain expert, mirroring the human pattern.

Architecture refinement from convergent implementations. Both Roberts and suprathermal externalize the synthesis step. This suggests H1 should be tested with the matrix-filling-only design (Condition B = LLM scores cells, classical code totals), not by asking the model for a final ranking. Otherwise we’d be measuring whether the LLM can be talked into structured output, not whether ACH reduces confirmation bias.

See Also