Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Statistical Frontiers in LLMs and Foundation Models

Monty Hall and Score Optimization in Conformal Prediction to Improve LLMs for MCQs

Harit Vishwakarma · Alan Mishler · Thomas Cook · Niccolo Dalmasso · Natraj Raman · Sumitra Ganesh

Keywords: [ Monty Hall ] [ Uncertainty Quantification ] [ Multiple Choice Question Answering ] [ Tool Usage Learning ] [ Conformal Prediction ] [ Prompt Engineering ]

[ ] [ Project Page ]
Sat 14 Dec noon PST — 12:45 p.m. PST

Abstract: Uncertainty quantification (UQ) is critical for safely deploying large language models (LLMs) in high-stakes settings like healthcare, law, and finance, where overconfident, incorrect predictions—often referred to as "hallucinations"—can lead to serious consequences. In multiple-choice question (MCQ) tasks, commonly used to evaluate LLMs and in tool usage learning, the lack of reliable uncertainty estimates can undermine model safety. To address this, prior works have leveraged conformal prediction (CP), a model-agnostic framework that provides distribution-free guarantees on prediction reliability. CP transforms a score function, which measures how well an output ``conforms’’ to a given input, into prediction sets that contain the true answer with high probability. While CP ensures this coverage guarantee for arbitrary score functions, the quality of the scores significantly impacts the size of prediction sets. Prior works have relied on LLM logits or other heuristic scores, lacking guarantees on their quality. To address this issue, we propose an optimization framework (CP-OPT) to learn score functions that minimize set sizes while maintaining coverage guarantees. Furthermore, leveraging the coverage guarantees of CP, we propose a conformal revision of questions (CROQs) to revise MCQ by narrowing down the available choices to those in the CP prediction set. Our results on MMLU and ToolAlpaca datasets with Llama3 and Phi-3 models demonstrate that optimized CP scores reduce the set sizes by up to 13\% and CROQs procedure improves accuracy relatively by up to $4.6\%$ overall and up to $15\%$ in non-trivial parts of the input space.

Chat is not available.