Poster
Evaluate calibration of language models with folktexts
André F. Cruz · Celestine Mendler-Dünner · Moritz Hardt
While large language models have increased dramatically in accuracy on numerous tasks, they are still lacking in their ability to express uncertainty about outcomes. Calibration is a fundamental form of uncertainty quantification. A calibrated risk score, on average, reflects the true frequency of outcomes in a population. We introduce folktexts, a software package that provides datasets and tools to evaluate and benchmark the calibration properties of large language models. Our goal is to strengthen the evaluation ecosystem in a direction that was previously underserved, specifically, the systematic evaluation of uncertainty quantification in large language models. Under the hood, folktexts derives datasets consisting of prompt-completion pair from US Census data products, specifically, the American Community Survey. The package provides an easy-to-use, extensible API that allows for different models, metrics, prompting templates, and ways to extract predictive scores from language models. We demonstrate the necessity and utility of our package through a large-scale evaluation of popular large language models. Our empirical results show that, despite having surprisingly strong predictive capabilities, model outputs are wildly miscalibrated.
Live content is unavailable. Log in and register to view live content