Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Statistical Frontiers in LLMs and Foundation Models

A Framework for Evaluating LLMs Under Task Indeterminacy

Luke Guerdan · Hanna Wallach · Solon Barocas · Alexandra Chouldechova

Keywords: [ large language models ] [ reliability ] [ validity ] [ uncertainty quantification ] [ evaluation frameworks ]

[ ] [ Project Page ]
Sat 14 Dec 3:45 p.m. PST — 4:30 p.m. PST

Abstract:

LLM evaluations often assume that NLP tasks have a single "gold label" defining a correct response. However, NLP tasks can be underspecified. For example, a task can be ambiguous if an instruction does not provide enough context to infer the intent of the speaker. NLP tasks can also be vague if a high-level concept (e.g.,"stereotype’") is poorly defined. Both ambiguity and vagueness can cause indeterminacy: a condition where an instruction does not have a single "correct’’ response. In this work, we develop a framework for evaluating LLMs under indeterminacy. Our framework disentangles the relationship between task specification, human ratings, and model outputs in the LLM evaluation pipeline. Leveraging our framework, we conduct synthetic experiments demonstrating that the ``gold label’’ assumption yields incorrect performance assessments under indeterminacy. We also provide tools for constructing informative performance intervals given partial information about indeterminacy in an evaluation corpus. We conclude by outlining implications of our work for the LLM research community.

Chat is not available.