Poster
in
Workshop: Socially Responsible Language Modelling Research (SoLaR)
Beyond the Binary: Capturing Diverse Preferences With Reward Regularization
Vishakh Padmakumar · Chuanyang Jin · Hannah Rose Kirk · He He
Keywords: [ regularization ] [ subjective examples ] [ reward models ] [ diverse preferences ]
Large language models (LLMs) are increasingly deployed via public-facing interfaces to interact with millions of users, each with diverse preferences. Despite this, preference tuning of LLMs predominantly relies on reward models trained using binary judgments where annotators select the preferred choice out of pairs of model outputs. In this work, we argue that this reliance on binary choices does not capture the broader, aggregate preferences of the target user in real-world tasks. We propose a taxonomy that identifies two dimensions of subjectivity where different users would disagree on the preferred output---when prompts allow for multiple correct answers, and when the candidate outputs are paraphrases of each other---and show that reward models correlate weakly with user preferences in these examples. Finally, as a first step towards mitigating this issue, we augment existing binary preference datasets with synthetic preference judgments that estimate potential disagreement among users. Incorporating these via a margin term as a form of regularization during model training yields predictions that correlate better with the aggregate user preferences.