Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Adaptive Foundation Models: Evolving AI for Personalized and Efficient Learning

P3O: Pessimistic Preference-based Policy Optimization for Robust Alignment from Preferences

Dhawal Gupta · Christoph Dann · Alekh Agarwal


Abstract:

We study reinforcement learning (RL) settings where the agent only has access to preferences on the relative quality of a pair of trajectories, obtained as a fixed \emph{offline preference dataset}, where pairs of trajectories collected according to some base policy are labeled with the preference feedback. A reward or pairwise preference function trained from this offline dataset is then used to provide feedback during RL training, and there is a substantial body of work on RL methods for these settings. However, a bulk of the literature ignores the uncertainty of the learned preference function, which leads to reward hacking or overoptimization. In this work, we formulate theoretically sound objectives for preference-based RL which are provably robust to overoptimization through the use of pessimism in the face of uncertainty, and design practical algorithms to optimize these objectives. We evaluate our algorithms on the task of fine-tuning language models from human feedback, and show a remarkable resilience to overoptimization.

Chat is not available.