Claim
Plans generated with an explicit Red Team step contain more adversarially robust decisions than plans without it, as judged by domain experts.
Why Mirror Imaging Is the Target
Mirror Imaging causes LLMs (like human analysts) to model adversaries as rational actors sharing their own values and constraints. Red Team explicitly forces modeling of a maximally motivated, differently-valued opponent — a process intervention against the default mirror.
Experimental Setup
- Domain scenarios with strong adversarial structure: security planning, business strategy, incident response, threat modeling
- Condition A: generate plan directly
- Condition B: generate plan → Red Team it (“model an intelligent adversary trying to defeat this plan”) → revise → final plan
- Blind domain experts (red team professionals, security architects, strategy consultants) rate the final plan for adversarial robustness
What to Measure
- Do Red-Teamed plans anticipate more attack vectors?
- Do they include more adversarially-motivated edge cases?
- Are they scored as harder to defeat by red team professionals — using a defined rubric?
- Do they fail “obvious” attacks at a lower rate?
Why It Could Fail
LLMs tend toward generic adversarial thinking regardless of context — “the adversary could exfiltrate data via API” is plausible-sounding but non-specific. Domain experts may find the adversarial modeling superficial. The deeper risk is cultural mirror imaging — see empirical evidence below.
Empirical Evidence
Direct empirical caution from cultural-bias work.
| Source | Finding | Implication |
|---|---|---|
| Durmus et al. (Anthropic, 2023) | LLM default responses align most with USA / Western European opinions (GlobalOpinionQA). Country-conditioning shifts opinion but introduces stereotyping. Translation alone does not shift opinion alignment. | The “adversary” an LLM red-teams against is, by default, a Western-flavored adversary. Persona prompts work imperfectly. |
| Atari et al. (2023) Which Humans? | The “WEIRD bias” critique — frontier models reflect Western, Educated, Industrialized, Rich, Democratic priors | For threat modeling against non-Western actors, default red-team outputs may be systematically miscalibrated |
Practical implication for H4 design:
- Include a manipulation check: does the model’s red-team output actually exhibit different values/priorities from its default plan, or just rewording?
- Test specifically against non-Western adversary archetypes — a Western model red-teaming a Western plan may produce only superficial adversarial scenarios
- The condition that’s likely to work: red-team with explicit, specific adversary specification (cultural context, motivation, constraints) — not generic “an adversary”
See Also
- Red Team Analysis · Mirror Imaging · Outside-In Thinking
- Testable Hypotheses framework
- Durmus et al. — Global Opinions — the key empirical foundation