Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback

Authors: Katherine Tian, Eric Mitchell et al. (Stanford) Venue: EMNLP 2023 (camera-ready) Canonical URL: https://arxiv.org/abs/2305.14975

Summary

RLHF dramatically degrades the token-probability calibration of language models — RLHF’d models are systematically overconfident in their conditional probabilities. But Tian et al. show that verbalized confidence — asking the model “how confident are you?” — recovers calibration, often reducing expected calibration error by ~50% on TriviaQA, SciQ, and TruthfulQA.

The simple intervention is in the title: just ask. The information is recoverable; you just have to query for it directly rather than relying on the token logprobs.

Key Findings

RLHF wrecks token-probability calibration. Base models are well-calibrated; RLHF’d models are not.
Verbalized confidence > logprobs. For ChatGPT, GPT-4, Claude, asking for confidence in words yields better-calibrated scores than the model’s own probabilities.
~50% relative reduction in ECE on three benchmarks.
Method generalizes across prompting strategies. Multiple verbalization formats work; the effect is not prompt-specific.

Relevance to This Wiki

Direct evidence for H6 (What If? reduces overconfidence). Tian establishes that asking for uncertainty in structured ways works. What If? is a specific structured prompt that should activate the same mechanism.
Critical for Overconfidence Bias. The bias is fixable by prompt intervention — the information is there.
Empirical mirror of Kadavath et al. (2022). Kadavath shows internal self-knowledge exists; Tian shows you can extract it with a simple prompt.
Combined with Kadavath, this supports the prompt-engineering protocol approach the wiki takes — SATs as ways to query for what the model already knows but doesn’t surface.

Tian et al. — Just Ask for Calibration (2023)

Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback

Summary

Key Findings

Relevance to This Wiki

See Also

Graph View

Table of Contents

Backlinks