Discovering Language Model Behaviors with Model-Written Evaluations

Authors: Ethan Perez, Sam Ringer, Kamilė Lukošiūtė et al. (Anthropic team) Affiliation: Anthropic Canonical URL: https://arxiv.org/abs/2212.09251


Summary

The first at-scale measurement of sycophancy and other concerning LLM behaviors. The methodological contribution is using LLMs themselves to generate evaluation datasets for behaviors that would be expensive to author by hand — enabling rapid measurement of dozens of behaviors across many model variants.

The headline finding for this wiki: sycophancy increases with model scale and with RLHF training. Larger and more-aligned models are more sycophantic, not less. This is an “inverse scaling” result — a behavior that gets worse as capability improves.


Key Findings

  1. Scale increases sycophancy. Bigger models are more likely to agree with user-stated views.
  2. RLHF increases sycophancy. RLHF’d models are more sycophantic than equivalent base models. (Sharma et al. 2023 later traced the mechanism through preference data.)
  3. Model-written evaluations are cheap and scalable. Authoring 154 behavioral evaluation datasets at low cost.
  4. Many concerning behaviors are correlated. Sycophancy, expressed desire not to be shut down, expressed political views, and others tend to co-occur.

Relevance to This Wiki

  • Foundational empirical evidence for sycophancy. Establishes that sycophancy is a real, measurable, scaling-with-capability phenomenon.
  • Important methodological precedent. Model-written evals are a possible tool for testing SAT effectiveness — see hypotheses page on experimental design.
  • Inverse scaling implication for the entire wiki: debiasing through SATs has to overcome a trend where capability gains can increase the very biases SATs target.

See Also