Poster
in
Workshop: Safe Generative AI
On the Protocol for Evaluating Uncertainty in Generative Question-Answering Tasks
Andrea Santilli · Miao Xiong · Michael Kirchhof · Pau Rodriguez · Federico Danieli · Xavier Suau · Luca Zappella · Sinead Williamson · Adam Golinski
Knowing when a language model is uncertain about its generations is a key challenge for enhancing LLMs’ safety and reliability. An increasing issue in the field of uncertainty quantification for Large Language Models (LLMs) is that the performance values reported across papers are often incomparable, and sometimes even directly conflicting, due to different evaluation protocols. In this paper, we highlight the design decisions and implementation details that go into evaluating uncertainty estimation for selective answering in generative Question Answering (QA) tasks. First, we analyze several prior works and highlight the differences in their evaluation protocols. Next, we perform empirical evaluations according to two different protocols in the related literature, and find that the conflicting results between the prior works can be attributed to an interaction between the substring-overlap response-quality metric and some uncertainty estimation methods due to the spurious correlation via response length.