This page is the framework for eight testable hypotheses about SATs + LLM quality. The meta-hypothesis H0 lives here (it applies to all the others). Each of H1–H7 has its own page with claim, experimental setup, failure modes, and empirical evidence — linked from the table below.
The Meta-Hypothesis (Test This First)
H0 — Structural compliance ≠ debiasing
The most important hypothesis to falsify before everything else. Do LLMs genuinely reason differently when following SAT structure, or do they reformat the same biased output into a more rigorous-looking shell?
Test: Compare the semantic distance between hypotheses/conclusions in SAT-structured vs. unstructured outputs on identical inputs. Low semantic distance = compliance theater — the model is satisfying the format without changing the reasoning.
If H0 holds (compliance without change), all downstream hypotheses are confounded. A positive result on H1–H7 that fails H0 just means SAT-formatted outputs look better to raters, not that reasoning improved.
H0 is most likely confounded with sycophancy: models may perform SAT compliance for the same reason they sycophantically agree — the structured output signals approval to the RLHF-trained reward model. See Sharma 2023 for the evidence that the reward landscape favors form over substance.
The Seven Testable Hypotheses
| # | Hypothesis | Bias targeted | Empirical status |
|---|---|---|---|
| H1 | ACH improves conclusion accuracy on ambiguous evidence | Confirmation bias | Partial caution (Mitre 2004); untested in LLMs |
| H2 | Devil’s Advocacy maintains positions under pushback | Sycophancy | Strong caution (Sharma, Nemeth 2001) |
| H3 | KAC breaks framing-driven anchoring | Anchoring | Partial support (Echterhoff) |
| H4 | Red Team produces adversarially robust plans | Mirror imaging | Caution from cultural-bias work (Durmus) |
| H5 | Multi-agent pipelines outperform single-agent chains | Groupthink | Partial support (Du); shared-model caveat |
| H6 | What If? reduces overconfidence / improves calibration | Overconfidence | Strong (Tian, Kadavath) |
| H7 | Epistemic labeling reduces confident hallucination rate | Hallucination | Mechanism confirmed (Kadavath); protocol untested |
Bottom line: the structural argument has empirical backing (the biases are real in LLMs, and the mechanisms the SATs target are internally accessible). The intervention argument — that the specific SAT protocols actually exploit those mechanisms — is largely untested. The closest existing work is Echterhoff’s BiasBuster (partial validation of H3) and Du’s multi-agent debate (partial validation of H5).
H6 has the strongest mechanistic backing — Tian directly shows uncertainty prompts work, Kadavath shows the information is internally present. The specific What If? variant is the only untested piece.
Experimental Design Principles
These apply to all seven hypotheses.
Ground Truth Is the Hardest Problem
The cleanest experiments use scenarios with known outcomes:
- Historical incidents where the correct attribution is established
- Red team exercises with pre-tested solutions
- Factual Q&A against source documents with verifiable answers
Open-ended planning is hard to score — human rater disagreement is high, and “quality” is under-defined.
Blind Human Rating vs. LLM-as-Judge
LLM-as-judge introduces sycophancy toward SAT-structured outputs — structured outputs look more rigorous and will score higher on perceived quality even if the underlying reasoning is identical. Use blind human raters for qualitative claims.
Operationalize “Quality” Specifically
Different hypotheses target different quality dimensions:
| Hypothesis | Quality Dimension |
|---|---|
| H1 (ACH) | Conclusion accuracy against ground truth |
| H2 (Devil’s Advocacy) | Resistance to social pressure / position stability |
| H3 (KAC) | Independence from initial framing |
| H4 (Red Team) | Adversarial robustness |
| H5 (Multi-agent) | Rate of genuine self-revision |
| H6 (What If?) | Calibration of expressed confidence |
| H7 (Epistemic labeling) | Hallucination rate on verifiable claims |
These are distinct. A system that scores well on H7 (hallucination) may score poorly on H2 (sycophancy resistance) — they target different failure modes.
The Roberts Anti-Pattern Warning
Scott Roberts (2025) found that single-prompt ACH is an anti-pattern — the model generates a complete narrative and then forces evidence to fit. This suggests: format compliance is not sufficient; process order matters. Hypotheses H1, H3, and H7 all have strict step-ordering requirements that must be enforced architecturally, not just via prompt instruction.
See Also
- SATs for LLM Agents — the underlying case for why SATs should work
- SAT Selection Guide — which SAT to use for which bias
- SAT Pipeline — how to chain SATs into testable workflows
- Bias × SAT Matrix — full cross-reference of bias → SAT
- System 2 — theoretical explanation for why H0 may hold