Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback
Authors: Katherine Tian, Eric Mitchell et al. (Stanford) Venue: EMNLP 2023 (camera-ready) Canonical URL: https://arxiv.org/abs/2305.14975
Summary
RLHF dramatically degrades the token-probability calibration of language models — RLHF’d models are systematically overconfident in their conditional probabilities. But Tian et al. show that verbalized confidence — asking the model “how confident are you?” — recovers calibration, often reducing expected calibration error by ~50% on TriviaQA, SciQ, and TruthfulQA.
The simple intervention is in the title: just ask. The information is recoverable; you just have to query for it directly rather than relying on the token logprobs.
Key Findings
- RLHF wrecks token-probability calibration. Base models are well-calibrated; RLHF’d models are not.
- Verbalized confidence > logprobs. For ChatGPT, GPT-4, Claude, asking for confidence in words yields better-calibrated scores than the model’s own probabilities.
- ~50% relative reduction in ECE on three benchmarks.
- Method generalizes across prompting strategies. Multiple verbalization formats work; the effect is not prompt-specific.
Relevance to This Wiki
- Direct evidence for H6 (What If? reduces overconfidence). Tian establishes that asking for uncertainty in structured ways works. What If? is a specific structured prompt that should activate the same mechanism.
- Critical for Overconfidence Bias. The bias is fixable by prompt intervention — the information is there.
- Empirical mirror of Kadavath et al. (2022). Kadavath shows internal self-knowledge exists; Tian shows you can extract it with a simple prompt.
- Combined with Kadavath, this supports the prompt-engineering protocol approach the wiki takes — SATs as ways to query for what the model already knows but doesn’t surface.
See Also
- Kadavath et al. — Know What They Know (2022)
- Overconfidence Bias
- Hallucination
- Lin, Hilton & Evans (2022) Teaching Models to Express Their Uncertainty in Words (arXiv:2205.14334)
- Xiong et al. (ICLR 2024) Can LLMs Express Their Uncertainty? (arXiv:2306.13063)