Poster
in
Workshop: Statistical Frontiers in LLMs and Foundation Models
Back-to-Basics Revisited: Benchmarking an Expanded Set of RLHF Algorithms
Lucas Spangher · Rama Kumar Pasumarthi · Nick Masiewicki · Peter Grabowski · Eugene Ie · William Arnold · Daniele Calandriello · Bilal Piot
Keywords: [ RLHF ] [ benchmarking ] [ REINFORCE ]
Large Language Models (LLMs) have demonstrated impressive text generation capabilities, but often produce outputs that do not align with human preferences, necessitating the incorporation of Reinforcement Learning from Human Feedback (RLHF) in the training process. While Proximal Policy Optimization (PPO) initially emerged as a popular RLHF strategy, its complexity and inefficiency have prompted the exploration of simpler alternatives like REINFORCE. Building on the work of the 2024 paper "Back to Basics," which argued against PPO's suitability for RLHF, we extend the investigation by \textbf{benchmarking sixteen state of the art RLHF algorithms} on a standard bencmark, ranging from simpler to more complex approaches, on well-studied tasks. Our contribution includes extensive hyperparameter sweeps and a robust suite of evaluation metrics, including ROUGE scores, providing a more comprehensive analysis of RLHF algorithm performance. Our goal is to guide users in selecting the most effective RLHF algorithm and to promote a culture of thorough and impartial benchmarking.