Poster
in
Workshop: Pluralistic Alignment Workshop
Group Robust Best-of-K Decoding of Language Models for Pluralistic Alignment
Anja Petrovic · Seongho Son · William Bankes · Xiaohang Tang · Shyam Sundhar Ramesh · Sangwoong Yoon · Ilija Bogunovic
The desirable behaviour of a chat agent can be described with multiple criteria, such as harmlessness, helpfulness, and conciseness, represented by reward models. While each user, or a group of users, may perceive each criterion with different significance, it is difficult to know how much an individual user or a group would weigh one criterion over another in many practical application scenarios.Instead of assuming the knowledge of the weights among multiple criteria, we propose a robust group alignment objective to maximise the worst reward among the group of reward models. To test this approach, we use best-of-K rejection sampling to demonstrate the properties of an algorithm that employs our robust objective. Finally, we propose several interesting avenues of future exploration that may lead to more practical algorithms than group robust best-of-K rejection sampling.