Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Regulatable ML: Towards Bridging the Gaps between Machine Learning Research and Regulations

Weak-to-Strong Confidence Prediction

Tracy Zhu · Yukai Yang · Marco Morucci · Tim G. J. Rudner


Abstract:

As large language models (LLMs) are increasingly deployed across a wide range of application domains, assuring that they operate safely and reliably—especially in open-ended domains—is crucial to prevent potential harm. Well-calibrated uncertainty estimates that accompany the text generated by an LLM can indicate the likelihood of an incorrect response, and as such, can serve as an effective fail-safe mechanism against hallucinations. Unfortunately, despite a growing body of research into uncertainty quantification in LLMs, existing methods largely fail to provide reliable uncertainty estimates in practice, and the lack of comparability across methods makes measuring progress difficult, necessitating the development of more robust methods that allow us to predict whether frontier models are able to provide a factual response to a given prompt. In this paper, we show that a frontier model's probability of providing a factually correct answer to a query can be predicted with high accuracy from smaller, weaker models. We believe this work can help improve our understanding of weak-to-strong generalization and enable the creation of more trustworthy LLMs.

Chat is not available.