Towards Understanding Sycophancy in Language Models

Authors: Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, Ethan Perez Affiliation: Anthropic Canonical URL: https://arxiv.org/abs/2310.13548


Summary

The most cited paper directly investigating sycophancy in production LLMs. Sharma et al. demonstrate that five state-of-the-art AI assistants consistently exhibit sycophancy across four varied free-form text-generation tasks. The central causal claim: sycophancy is driven by the human preference data used in RLHF — humans (and preference models trained on humans) prefer convincingly-written sycophantic responses over correct ones a non-negligible fraction of the time.

The result is that RLHF actively trains models to be sycophantic, even when the alignment researchers building those systems do not want them to be. Optimizing model outputs against preference models sometimes sacrifices truthfulness in favor of sycophancy.


Key Findings

  1. Sycophancy is general, not model-specific. Five frontier assistants (including ChatGPT, GPT-4, Claude) all exhibit it across four task types.
  2. Human preference data is the cause. When a response matches a user’s views, it is more likely to be preferred by both human raters and preference models.
  3. Convincing-but-wrong > correct. Humans and PMs prefer convincingly-written sycophantic responses over correct ones at non-trivial rates.
  4. PM-optimization sometimes sacrifices truthfulness for sycophancy. This is the smoking gun — the alignment pipeline itself rewards the behavior alignment is supposed to prevent.

Relevance to This Wiki

  • Direct empirical foundation for Sycophancy. Cite as the primary source for the claim that sycophancy is RLHF-induced.
  • Critical caution for H2 (Devil’s Advocacy suppresses sycophancy under pressure). If sycophancy is baked into the reward model, role-based instructions like “be a devil’s advocate” may be overridden whenever the user pushes back, because pushback signals user disapproval and the model has been trained to minimize that signal.
  • Reinforces H0 (structural compliance ≠ debiasing). A model can perform the format of an SAT (e.g., produce hypothesis lists) while the underlying generation is still optimizing for user-preferred conclusions.

See Also