Poster
in
Workshop: Attributing Model Behavior at Scale (ATTRIB)
Weak-to-Strong Confidence Prediction
Yukai Yang · Tracy Zhu · Marco Morucci · Tim G. J. Rudner
As large language models (LLMs) are increasingly deployed across a wide range of application domains, understanding their capacity through uncertainty—especially in open-ended domains—is crucial to ensuring that they operate safely and reliably. Well-calibrated uncertainty estimates that accompany the text generated by an LLM can indicate the likelihood of an incorrect response, and as such, can serve as an effective fail-safe mechanism against hallucinations. Unfortunately, despite a growing body of research into uncertainty quantification in LLMs, existing methods largely fail to provide reliable uncertainty estimates in practice, and the lack of comparability across methods makes measuring progress difficult, necessitating the development of more robust methods that allow us to predict whether frontier models are able to provide a factual response to a given prompt. In this paper, we show that the probability of a frontier model providing a factually correct answer to a query can be predicted with high accuracy from smaller, weaker models. We believe that this work contributes to a deeper understanding of model capacity, particularly in terms of weak-to-strong generalization, and facilitates the creation of more trustworthy LLMs.