Testable Hypotheses: SATs + LLM Quality (Framework)

This page is the framework for eight testable hypotheses about SATs + LLM quality. The meta-hypothesis H0 lives here (it applies to all the others). Each of H1–H7 has its own page with claim, experimental setup, failure modes, and empirical evidence — linked from the table below.

The Meta-Hypothesis (Test This First)

H0 — Structural compliance ≠ debiasing

The most important hypothesis to falsify before everything else. Do LLMs genuinely reason differently when following SAT structure, or do they reformat the same biased output into a more rigorous-looking shell?

Test: Compare the semantic distance between hypotheses/conclusions in SAT-structured vs. unstructured outputs on identical inputs. Low semantic distance = compliance theater — the model is satisfying the format without changing the reasoning.

If H0 holds (compliance without change), all downstream hypotheses are confounded. A positive result on H1–H7 that fails H0 just means SAT-formatted outputs look better to raters, not that reasoning improved.

H0 is most likely confounded with sycophancy: models may perform SAT compliance for the same reason they sycophantically agree — the structured output signals approval to the RLHF-trained reward model. See Sharma 2023 for the evidence that the reward landscape favors form over substance.

The Seven Testable Hypotheses

#	Hypothesis	Bias targeted	Empirical status
H1	ACH improves conclusion accuracy on ambiguous evidence	Confirmation bias	Partial caution (Mitre 2004); untested in LLMs
H2	Devil’s Advocacy maintains positions under pushback	Sycophancy	Strong caution (Sharma, Nemeth 2001)
H3	KAC breaks framing-driven anchoring	Anchoring	Partial support (Echterhoff)
H4	Red Team produces adversarially robust plans	Mirror imaging	Caution from cultural-bias work (Durmus)
H5	Multi-agent pipelines outperform single-agent chains	Groupthink	Partial support (Du); shared-model caveat
H6	What If? reduces overconfidence / improves calibration	Overconfidence	Strong (Tian, Kadavath)
H7	Epistemic labeling reduces confident hallucination rate	Hallucination	Mechanism confirmed (Kadavath); protocol untested

Bottom line: the structural argument has empirical backing (the biases are real in LLMs, and the mechanisms the SATs target are internally accessible). The intervention argument — that the specific SAT protocols actually exploit those mechanisms — is largely untested. The closest existing work is Echterhoff’s BiasBuster (partial validation of H3) and Du’s multi-agent debate (partial validation of H5).

H6 has the strongest mechanistic backing — Tian directly shows uncertainty prompts work, Kadavath shows the information is internally present. The specific What If? variant is the only untested piece.

Experimental Design Principles

These apply to all seven hypotheses.

Ground Truth Is the Hardest Problem

The cleanest experiments use scenarios with known outcomes:

Historical incidents where the correct attribution is established
Red team exercises with pre-tested solutions
Factual Q&A against source documents with verifiable answers

Open-ended planning is hard to score — human rater disagreement is high, and “quality” is under-defined.

LLM-as-judge introduces sycophancy toward SAT-structured outputs — structured outputs look more rigorous and will score higher on perceived quality even if the underlying reasoning is identical. Use blind human raters for qualitative claims.

Operationalize “Quality” Specifically

Different hypotheses target different quality dimensions:

Hypothesis	Quality Dimension
H1 (ACH)	Conclusion accuracy against ground truth
H2 (Devil’s Advocacy)	Resistance to social pressure / position stability
H3 (KAC)	Independence from initial framing
H4 (Red Team)	Adversarial robustness
H5 (Multi-agent)	Rate of genuine self-revision
H6 (What If?)	Calibration of expressed confidence
H7 (Epistemic labeling)	Hallucination rate on verifiable claims

These are distinct. A system that scores well on H7 (hallucination) may score poorly on H2 (sycophancy resistance) — they target different failure modes.

The Roberts Anti-Pattern Warning

Scott Roberts (2025) found that single-prompt ACH is an anti-pattern — the model generates a complete narrative and then forces evidence to fit. This suggests: format compliance is not sufficient; process order matters. Hypotheses H1, H3, and H7 all have strict step-ordering requirements that must be enforced architecturally, not just via prompt instruction.

Testable Hypotheses: SATs + LLM Quality (Framework)

The Meta-Hypothesis (Test This First)

The Seven Testable Hypotheses

Experimental Design Principles

Ground Truth Is the Hardest Problem

Blind Human Rating vs. LLM-as-Judge

Operationalize “Quality” Specifically

The Roberts Anti-Pattern Warning

See Also

Graph View

Table of Contents

Backlinks