Anthropic

AI safety company. Produces a substantial fraction of the published empirical work on LLM bias, sycophancy, calibration, and alignment that is directly relevant to this wiki — including the Sharma sycophancy paper, Perez model-written evaluations, Kadavath introspection work, and Durmus global opinions paper.

The methodological approach across Anthropic’s publications is consistent: large-scale empirical measurement of behaviors, with explicit attention to scaling effects (does the behavior get better or worse as model capability increases) and to the role of RLHF in shaping behavior.

Wiki Sources

Sharma et al. — Sycophancy (2023)
Perez et al. — Model-Written Evaluations (2022)
Kadavath et al. — Know What They Know (2022)
Durmus et al. — Global Opinions (2023)
Sharma et al. — Sycophancy (2023) (also)

Anthropic

Anthropic

Wiki Sources

Graph View

Table of Contents

Backlinks