Poster
in
Workshop: Statistical Frontiers in LLMs and Foundation Models
A Framework for Evaluating LLMs Under Task Indeterminacy
Luke Guerdan · Hanna Wallach · Solon Barocas · Alexandra Chouldechova
Keywords: [ large language models ] [ reliability ] [ validity ] [ uncertainty quantification ] [ evaluation frameworks ]
LLM evaluations often assume that NLP tasks have a single "gold label" defining a correct response. However, NLP tasks can be underspecified. For example, a task can be ambiguous if an instruction does not provide enough context to infer the intent of the speaker. NLP tasks can also be vague if a high-level concept (e.g.,"stereotype’") is poorly defined. Both ambiguity and vagueness can cause indeterminacy: a condition where an instruction does not have a single "correct’’ response. In this work, we develop a framework for evaluating LLMs under indeterminacy. Our framework disentangles the relationship between task specification, human ratings, and model outputs in the LLM evaluation pipeline. Leveraging our framework, we conduct synthetic experiments demonstrating that the ``gold label’’ assumption yields incorrect performance assessments under indeterminacy. We also provide tools for constructing informative performance intervals given partial information about indeterminacy in an evaluation corpus. We conclude by outlining implications of our work for the LLM research community.