Poster
in
Workshop: Pluralistic Alignment Workshop
Critique-out-Loud Reward Models
Zachary Ankner · Mansheej Paul · Brandon Cui · Jonathan Chang · Prithviraj Ammanabrolu
Generally, reward models used for reinforcement learning from human feedback (RLHF) are trained to directly predict preference scores without leveraging the generation capabilities of the underlying large language model (LLM). To enable reward models to reason explicitly about the quality of a response, we introduce Critique-out-Loud (CLoud) reward models. CLoud reward models operate by generating a natural language critique of the assistant's response that is then used to predict a scalar reward for the quality of the response. While classic reward models typically model the average preference over a set of humans, CLoud reward models can model a diverse set of preferences more faithfully through reasoning about multiple preference in the generated critique. We demonstrate the success of CLoud reward models for both Llama-3-8B and 70B models: compared to classic reward models, CLoud reward models improve pairwise preference classification accuracy on RewardBench by 4.65 and 5.84 percentage points for the 8B and 70B base models respectively. Furthermore, CLoud reward models also lead to a Pareto improvement for win rate on ArenaHard when used as the scoring model for Best-of-N.