Poster
in
Workshop: Safe Generative AI
Efficient and Effective Uncertainty Quantification for LLMs
Miao Xiong · Andrea Santilli · Michael Kirchhof · Adam Golinski · Sinead Williamson
Uncertainty quantification (UQ) is crucial for ensuring the safe deployment of large language model, particularly in high-stakes applications where hallucinations can be harmful. However, existing UQ methods often demand substantial computational resources, e.g., multi-sample methods such as Semantic Entropy usually require 5-10 inference calls, and probing-based methods require additional datasets for training. This raises a key question: How can we balance UQ performance with computational efficiency} In this work, we first analyze the performance and efficiency of various UQ methods across 6 datasets x 6 models x 2 prompt strategies. Our findings reveal that: 1) Multi-sample methods generally perform only marginally better than single-sample methods, i.e., ≤ 0.02 in AUROC over 65% settings, despite significantly higher inference costs. 2) Probing-based methods perform well primarily on mathematical reasoning and truthfulness benchmarks, while multi-sample methods only show a clear advantage on knowledge-seeking tasks. These findings suggest that the high computational cost does not translate into significant performance gains. Despite their similar overall performance, we observe only moderate correlations between different UQ methods, suggesting they may be capturing different uncertainty signals. This motivates us to explore the potential of combining different methods to harness their complementary strengths at lower computational costs. Our experiments demonstrate that a simple combination of single-sample features can match or even outperform the existing best-performing methods. These findings suggest a promising direction for developing cost-effective uncertainty estimators.