Skip to yearly menu bar Skip to main content


Poster

Evaluate calibration of language models with folktexts

André F. Cruz · Celestine Mendler-Dünner · Moritz Hardt

East Exhibit Hall A-C #3403
[ ] [ Project Page ]
[ Poster
Fri 13 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

While large language models have increased dramatically in accuracy on numerous tasks, they are still lacking in their ability to express uncertainty about outcomes. Calibration is a fundamental form of uncertainty quantification. A calibrated risk score, on average, reflects the true frequency of outcomes in a population. We introduce folktexts, a software package that provides datasets and tools to evaluate and benchmark the calibration properties of large language models. Our goal is to strengthen the evaluation ecosystem in a direction that was previously underserved, specifically, the systematic evaluation of uncertainty quantification in large language models. Under the hood, folktexts derives datasets consisting of prompt-completion pair from US Census data products, specifically, the American Community Survey. The package provides an easy-to-use, extensible API that allows for different models, metrics, prompting templates, and ways to extract predictive scores from language models. We demonstrate the necessity and utility of our package through a large-scale evaluation of popular large language models. Our empirical results show that, despite having surprisingly strong predictive capabilities, model outputs are wildly miscalibrated.

Chat is not available.