Poster
Learning Goal-Conditioned Representations in Reward Models for Aligning Language Models
Vaskar Nath · Dylan Slack · Jeff Da · Yuntao Ma · Hugh Zhang · Spencer Whitehead · Sean Hendryx
Representation learning is important for the success of Reinforcement Learning (RL) algorithms, but has been less explored for Language Model (LM) alignment with Reinforcement learning from Human Feedback (RLHF).In this work, we present a simple yet effective approach to improve the representations learned by reward models for aligning LMs.Our method uses a contrastive loss that encourages reward models to learn goal-conditioned representations which encode the expected reward at intermediate steps of the input sequence.By enforcing this loss on representations from intermediate steps, we can capture which trajectories are likely to reach a desired goal (e.g., correct solution or helpful response) at different points in the sequence.This method is flexible enough to support different kinds of alignment data and does not require extra annotations.We demonstrate the effectiveness of this approach in 2 domains: mathematical reasoning and natural language alignment.On math benchmarks, such as GSM8k, we show that our approach improves the reward model's ability to discern between correct/incorrect solutions, increasing AUROC score by up to 0.11 points, and that the learned representations can help prune undesirable generations.Using this reward model to improve a policy model via RLHF yields accuracy gains of 1.7\% across several math benchmarks when compared to a standard preference-ranking trained reward model.Additionally, we show the that learned representations can be used to steer LMs toward generations that are more aligned with human preferences via guided decoding.Overall, our study underscores the potential of incorporating feedback signals in RLHF frameworks via learned representations, which we believe is a promising avenue for improving the alignment of LLMs.
Live content is unavailable. Log in and register to view live content