Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Statistical Frontiers in LLMs and Foundation Models

Uncertainty-Penalized Directed Preference Optimization

Sam Houliston · Alexander Immer · Alizée Pace · Gunnar Rätsch

Keywords: [ RLHF ] [ DPO ] [ Finetuning ] [ Uncertainty Penalization ] [ LLMs ]

[ ] [ Project Page ]
Sat 14 Dec noon PST — 12:45 p.m. PST

Abstract:

Aligning Large Language Models (LLMs) to human preferences in content, style, and presentation is challenging, in part because preferences are varied, context-dependent, and sometimes inherently ambiguous. While successful, Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are prone to the issue of proxy reward overoptimization. Our analysis of the DPO loss reveals a critical need for regularization for mislabelled or ambiguous preference pairs to avoid reward hacking. In this work, we develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes, inspired by offline reinforcement learning. The penalization serves to dampen gradient updates for uncertain samples. Evaluation of the methods is performed with GPT2 Medium on the Anthropic-HH dataset using a model ensemble to obtain uncertainty estimates, and shows improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses.

Chat is not available.