NeurIPS Evaluating language models as risk scores

Poster
in
Workshop: Statistical Frontiers in LLMs and Foundation Models

Evaluating language models as risk scores

André F. Cruz · Moritz Hardt · Celestine Mendler-Dünner

Keywords: [ large language models ] [ risk scores ] [ calibration ] [ benchmark ]

[ Abstract ] [ Project Page ]

[ Poster] [ OpenReview]

Sat 14 Dec 3:45 p.m. PST — 4:30 p.m. PST

Abstract:

Current question-answering benchmarks predominantly focus on accuracy in realizable prediction tasks.Conditioned on a question and answer-key, does the most likely token match the ground truth?Such benchmarks necessarily fail to evaluate language models' ability to quantify outcome uncertainty.In this work, we focus on the use of language models as risk scores for unrealizable prediction tasks.We introduce folktexts, a software package to systematically generate risk scores using language models, and evaluate them against US Census data products.A flexible API enables the use of different prompting schemes, local or web-hosted models, and diverse census columns that can be used to compose custom prediction tasks.We demonstrate the utility of folktexts through a sweep of empirical insights into the statistical properties of 17 recent large language models across five natural text benchmark tasks. We find that zero-shot risk scores produced by multiple-choice question-answering have high predictive signal but are widely miscalibrated.Base models consistently overestimate outcome uncertainty, while instruction-tuned models underestimate uncertainty and produce over-confident risk scores.In fact, instruction-tuning polarizes answer distribution regardless of true underlying data uncertainty.Conversely, verbally querying models for probability estimates results in substantially improved calibration across all instruction-tuned models.These differences in ability to quantify data uncertainty cannot be revealed in realizable settings, and highlight a blind-spot in the current evaluation ecosystem that folktexts covers.

Chat is not available.

Poster in Workshop: Statistical Frontiers in LLMs and Foundation Models

Evaluating language models as risk scores

André F. Cruz · Moritz Hardt · Celestine Mendler-Dünner

Poster
in
Workshop: Statistical Frontiers in LLMs and Foundation Models