Assessing the Value of Structured Analytic Techniques in the U.S. Intelligence Community

Authors: Stephen Artner, Richard S. Girven, James B. Bruce Publisher: RAND Corporation — National Defense Research Institute (sponsored by Office of the Secretary of Defense) Publication: RR-1408-OSD, 2016 Canonical URL: https://www.rand.org/pubs/research_reports/RR1408.html

Summary

A research design proposal and pilot study from RAND on how the U.S. Intelligence Community (IC) could systematically evaluate the effectiveness of Structured Analytic Techniques. The headline finding: the IC has greatly expanded its use of SATs since the WMD intelligence failures, but has done almost no systematic evaluation of whether they actually improve analysis. RAND reviewed 63 finished intelligence products from CIA, DIA, and NIC; only a minority used explicit SATs, but those that did “addressed a broader range of potential outcomes and implications than did other analyses.” The report proposes a multi-method evaluation framework (SAT reviews, structured interviews, controlled experiments) and surveys the limited prior empirical literature.

This is the most authoritative public source on SAT evaluation. It is also explicitly not a definitive answer — it is a proposal for the work that needs to happen.

Key Quotes

“If a structured analytic technique is specifically designed to mitigate or avoid one of the proven problems in human thought processes, and if the technique appears to be successful in doing so, that technique can be said to have face validity.” — citing Heuer & Pherson (2011, p. 309)

“Like any tool, SATs depend on the analytic skills and substantive expertise of their users to be effective. They are unlikely to benefit analysis if they become mechanical processes or box-checking exercises rather than aids to imaginative thinking.”

“We believe the use of SATs in a minority of CIA, DIA, and NIC documents provided more-sophisticated analysis than would have been possible with straight-line projections assuming continuity.”

On consumers: “We doubt that high-level policy officials will have the time or inclination to retrace the logic leading to several opposing outcomes if the analysis does not present a clear bottom line.”

Empirical Findings on Specific SATs

This is the most important section for the wiki — RAND surveyed the limited prior experimental evidence:

Study	Finding	Implication
Folker (Joint Military Intelligence College, 2000)	Structured hypothesis testing markedly improved analytic accuracy in one of two controlled experiments among theater Joint Intelligence Center analysts	Mixed results
Cheikes et al. (Mitre, 2004)	ACH reduced confirmation bias only among participants who lacked a professional intelligence background	Expert analysts may already be doing what ACH formalizes; ACH not free debiasing
Tetlock (Expert Political Judgment, 2005)	Scenario development reduced the accuracy of predictions in two experiments	But Heuer/Pherson note IC doesn’t use scenarios as predictive; uses them to outline futures
Nemeth, Brown & Rogers (2001)	Devil’s advocate technique does not necessarily promote genuine reexamination of assumptions; in some cases heightens confidence in preferred hypotheses	Formal devil’s advocacy may backfire; authentic dissent better
Khalsa (2009)	Systemic processes combining structured techniques + intuition improve outcomes across natural sciences, medicine, oil exploration, psychology	But few studies focused on intelligence specifically

“None of [IC] agencies is routinely measuring the use of SATs or how they affect the quality of analysis. The few relevant studies are dated, produced mixed results, and did not fully replicate conditions in the IC.”

RAND Pilot Study Findings (63-document sample)

CIA Intelligence Assessments (n=29):

23 showed no evidence of SAT use; 6 used SATs explicitly
The 6 that used SATs: 3 used alternative scenarios from facilitated brainstorming, 1 used Key Assumptions Check, 1 used Team B, 1 used Indicators
IAs without SATs tended toward “single-point outcomes assuming continuity” — consensus view with no examination of underlying assumptions

NIC products (n=14): Only 4 of 14 used SATs (2 centered on them, 2 in boxes/appendices)

Multi-topic sample (n=20): 8 of 20 used SATs (5 alternative scenarios, 1 brainstorming, 1 indicators, 1 structured analogies)

Across all 63: SATs were a minority. Where used, they conferred clear advantages on the ICD 203 tradecraft criteria of distinguishing assumptions from judgments and using logical argumentation, and especially on incorporating alternative analysis.

Key Concepts Introduced

SAT review — a proposed variant of the analytic line review, focused on a body of work to assess SAT incidence, contribution to key judgments, and effect on quality
Puzzles / Mysteries / Complexities (via Pherson’s colleague Greg Treverton):
- Puzzles — factual questions definitively answerable given evidence
- Mysteries — contingent future developments (e.g. foreign leader decisions)
- Complexities — nonlinear, largely unpredictable interactions in adaptive systems (e.g. world economy)
- SAT applicability varies by problem type — scenario techniques fit mysteries, not puzzles; diagnostic techniques (KAC) fit both
Face validity vs. empirical validity — most SATs have the former (logical-design argument), few have the latter
ICD 203 standards — 8 criteria for IC analytic tradecraft (sources/uncertainties/assumptions/alternatives/relevance/argumentation/consistency/accuracy)

Proposed Evaluation Methods

SAT reviews — qualitative review of body of products on a topic
Structured interviews of analysts, managers, methodologists, consumers, briefers
Correlation analysis — does SAT use correlate with ODNI Office of Analytic Integrity and Standards scores?
Controlled experiments that replicate IC group-interaction conditions (prior experiments tested individuals, not groups — major limitation)
IC-wide survey modeled on RAND/USMC tradecraft survey
Case studies on intelligence puzzles (where ground truth becomes available) and mysteries (where outcomes can be assessed) — measure SAT contribution to accuracy

Best Practices Identified

For scenario products specifically:

Be explicit about how key drivers were selected and why over alternatives
Document the methodology used to construct scenarios (in a box or appendix)
Include concrete, observable indicators that signal which outcome is becoming more likely
Track scenarios over time as baseline studies, not one-off products
Consider whether SAT-centered products suit senior consumers (often they don’t — recommendation to create internal SAT-driven products for working level)

Entities Mentioned (with wiki pages)

Stephen Artner — lead author
Richard S. Girven — co-author
James B. Bruce — co-author
Richards J. Heuer Jr. — extensively cited
Randolph H. Pherson — extensively cited
RAND Corporation — publisher
Office of the Director of National Intelligence — context
Greg Treverton — cited for the puzzles/mysteries/complexities typology
Philip Tetlock — cited for the Expert Political Judgment (2005) findings
Sundri Khalsa — Marine Corps intelligence expert cited
Stephen Marrin — cited on the difficulty of measuring analytic quality

Relation to LLM Agentic Systems

The RAND findings have direct relevance to the central thesis of this wiki — see Testable Hypotheses: SATs + LLM Quality:

The Mitre 2004 finding (ACH helps non-experts but not experts) has a direct LLM analog: does ACH help a general-purpose LLM more than a domain-tuned model? This is testable and is an important variant of H1.
The Nemeth (2001) finding (formal devil’s advocacy may increase confidence in preferred hypotheses) is a serious caution for H2. Same risk applies to LLM devil’s advocate prompts that get treated as performative rather than genuine.
The Tetlock (2005) finding (scenarios reduced prediction accuracy) is a serious caution for any LLM forecast-via-scenarios setup. Aligns with the framing-anchoring risk noted in Anchoring Bias.
The general “face validity vs. empirical validity” distinction is the meta-warning embedded in H0 — SATs look rigorous; we need to verify they are rigorous in LLM contexts.

Confidence

high for what RAND directly claims (their pilot study findings, their literature review, their proposed methodology). medium for the broader-applicability claims — RAND themselves emphasize the pilot is illustrative, not definitive.

Artner, Girven & Bruce — Assessing SATs (RAND RR1408, 2016)