NeurIPS Poster Evaluating language models as risk scores

Poster

Evaluating language models as risk scores

André F. Cruz · Moritz Hardt · Celestine Mendler-Dünner

East Exhibit Hall A-C #3403

[ Abstract ] [ Project Page ]

[ Paper] [ Poster]

Fri 13 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

Current question-answering benchmarks predominantly focus on accuracy in realizable prediction tasks.Conditioned on a question and answer-key, does the most likely token match the ground truth?Such benchmarks necessarily fail to evaluate LLMs' ability to quantify ground-truth outcome uncertainty.In this work, we focus on the use of LLMs as risk scores for unrealizable prediction tasks.We introduce folktexts, a software package to systematically generate risk scores using LLMs, and evaluate them against US Census data products.A flexible API enables the use of different prompting schemes, local or web-hosted models, and diverse census columns that can be used to compose custom prediction tasks.We evaluate 17 recent LLMs across five proposed benchmark tasks.We find that zero-shot risk scores produced by multiple-choice question-answering have high predictive signal but are widely miscalibrated.Base models consistently overestimate outcome uncertainty, while instruction-tuned models underestimate uncertainty and produce over-confident risk scores.In fact, instruction-tuning polarizes answer distribution regardless of true underlying data uncertainty.This reveals a general inability of instruction-tuned models to express data uncertainty using multiple-choice answers.A separate experiment using verbalized chat-style risk queries yields substantially improved calibration across instruction-tuned models.These differences in ability to quantify data uncertainty cannot be revealed in realizable settings, and highlight a blind-spot in the current evaluation ecosystem that folktexts covers.

Chat is not available.