Discovering Language Model Behaviors with Model-Written Evaluations
Authors: Ethan Perez, Sam Ringer, Kamilė Lukošiūtė et al. (Anthropic team) Affiliation: Anthropic Canonical URL: https://arxiv.org/abs/2212.09251
Summary
The first at-scale measurement of sycophancy and other concerning LLM behaviors. The methodological contribution is using LLMs themselves to generate evaluation datasets for behaviors that would be expensive to author by hand — enabling rapid measurement of dozens of behaviors across many model variants.
The headline finding for this wiki: sycophancy increases with model scale and with RLHF training. Larger and more-aligned models are more sycophantic, not less. This is an “inverse scaling” result — a behavior that gets worse as capability improves.
Key Findings
- Scale increases sycophancy. Bigger models are more likely to agree with user-stated views.
- RLHF increases sycophancy. RLHF’d models are more sycophantic than equivalent base models. (Sharma et al. 2023 later traced the mechanism through preference data.)
- Model-written evaluations are cheap and scalable. Authoring 154 behavioral evaluation datasets at low cost.
- Many concerning behaviors are correlated. Sycophancy, expressed desire not to be shut down, expressed political views, and others tend to co-occur.
Relevance to This Wiki
- Foundational empirical evidence for sycophancy. Establishes that sycophancy is a real, measurable, scaling-with-capability phenomenon.
- Important methodological precedent. Model-written evals are a possible tool for testing SAT effectiveness — see hypotheses page on experimental design.
- Inverse scaling implication for the entire wiki: debiasing through SATs has to overcome a trend where capability gains can increase the very biases SATs target.
See Also
- Sharma et al. — Sycophancy (2023) — the causal mechanism through RLHF preference data
- Sycophancy — concept page