ACH-Grounding (suprathermal, GitHub)

“LLMs hallucinate. Various strategies exist to cope with that. This project implements one of them.”

A working open-source Python implementation of Analysis of Competing Hypotheses (ACH) used as a grounding mechanism for LLM output. Repo: https://github.com/suprathermal/ACH-Grounding

Core Claim

LLM statements should be organized as a matrix of evidence × hypotheses, with each cell scored for the degree and direction of support. The LLM only fills the matrix cells; the synthesis (ranking hypotheses, computing confidence) is performed by classical, deterministic code.

This separation is the central design pattern: LLMs do focused per-cell judgments where they are strong; combinatorial reasoning over the full matrix is delegated to algorithms where they are reliable.

Why This Architecture

suprathermal cites two papers asserting that LLM hallucinations on combinatorially complex problems are provably inevitable and cannot be arbitrarily reduced:

arXiv:2401.11817 — Hallucination is inevitable (Xu et al.)
arXiv:2508.01781 — Follow-up on practical bounds

Therefore, for any required degree of analytic assurance, the combinatorial step must be moved out of the LLM. The ACH matrix is a structural fit because each cell is a local, low-context judgment.

Claimed Benefits

Stability against rogue/wrong observations — outliers don’t dominate
Stability against confirmation bias — every evidence is checked against every hypothesis, so no single hypothesis becomes the anchor for evidence interpretation
Forced exhaustive cross-referencing — “interrogating all situational knowledge out of the LLM”
External grounding — pre-existing verified hypotheses or evidence can be injected via ExtraH / ExtraE config

Process

Take pre-existing hypotheses and evidence from the user (or have the LLM generate them)
Estimate API cost; ask for approval if > $1
(Optional) Generate additional hypotheses about the question/event
Collect supporting and contradicting evidence
Cross-validate each hypothesis against each evidence to build a support matrix
Statistical analysis of the resulting matrix → likelihood scores per hypothesis
Output: per-hypothesis likelihood scores printed to screen and saved as CSV in time-stamped folder

Internal RAG: the system feeds previously generated hypotheses and evidence back as context to subsequent calls, enabling iterative coherent build-up. The author calls this “the most rudimentary form of RAG.”

Cost Model

The algorithm consumes O((|Evidence| + |Hypotheses|)²) tokens. The author is explicit that this is unsuitable for massive datasets — it is a “few hypotheses, few precious pieces of evidence, everything highly uncertain” tool.

This matches the human ACH use case: ACH is intended for hard, ambiguous, high-stakes questions, not for mass-data triage.

Hybrid-Workflow Support

Designed to accept input from:

Humans — analysts provide hypotheses and evidence via config files
LLMs — system autonomously generates hypotheses and evidence
Hybrid — analyst provides verified seeds; LLM extends; matrix evaluation runs over the union

The ExtraH and ExtraE parameters are the explicit hook for grounding: known-true facts and expert-asserted hypotheses are injected as constraints the LLM cannot override.

Self-Stated Scope

This is: a working personal/small-team tool, an implementation of basic ACH, a practical demo of LLM grounding the author believes is novel.

This is not: an enterprise-grade tool, a solution for massive data, a clean prompting example, a production-ready RAG framework.

Significance

This is a concrete, working artifact for H1 (ACH reduces confirmation bias in LLM evidence evaluation). Independent of Roberts (2025), it confirms the same architectural conclusion through a different design:

Roberts	suprathermal
Multi-step sequential LLM calls (hypotheses → evidence → per-cell scoring)	Multi-step sequential LLM calls (hypotheses → evidence → per-cell scoring)
Single-prompt ACH is an anti-pattern	Single-prompt ACH is implicitly rejected (cell-by-cell scoring)
LLM does scoring; analyst totals	LLM fills matrix; classical algorithms total
Streamlit + GPT-4 + LangChain + Pydantic	Python, configurable, CSV output

Both designs converge on the same principle: LLMs do local judgments; structure is enforced outside the LLM. This convergence is empirical evidence for the design pattern.

Connections

ACH — direct implementation
Confirmation Bias — the bias the matrix structurally counters
Hallucination — the failure mode this design mitigates
H1 — corroborating implementation
Roberts (2025) — independent convergent implementation
Huang et al. (2023) — RAG as hallucination mitigation taxonomy

Cited External Work

Wikipedia: Analysis of competing hypotheses
arXiv:2401.11817 — Hallucination inevitability
arXiv:2508.01781 — Follow-up bounds

suprathermal-ach-grounding-2024