NeurIPS Back-to-Basics Revisited: Benchmarking an Expanded Set of RLHF Algorithms

Poster
in
Workshop: Statistical Frontiers in LLMs and Foundation Models

Back-to-Basics Revisited: Benchmarking an Expanded Set of RLHF Algorithms

Lucas Spangher · Rama Kumar Pasumarthi · Nick Masiewicki · Peter Grabowski · Eugene Ie · William Arnold · Daniele Calandriello · Bilal Piot

Keywords: [ RLHF ] [ benchmarking ] [ REINFORCE ]

[ Abstract ] [ Project Page ]

[ OpenReview]

Sat 14 Dec 3:45 p.m. PST — 4:30 p.m. PST

Abstract:

Large Language Models (LLMs) have demonstrated impressive text generation capabilities, but often produce outputs that do not align with human preferences, necessitating the incorporation of Reinforcement Learning from Human Feedback (RLHF) in the training process. While Proximal Policy Optimization (PPO) initially emerged as a popular RLHF strategy, its complexity and inefficiency have prompted the exploration of simpler alternatives like REINFORCE. Building on the work of the 2024 paper "Back to Basics," which argued against PPO's suitability for RLHF, we extend the investigation by \textbf{benchmarking sixteen state of the art RLHF algorithms} on a standard bencmark, ranging from simpler to more complex approaches, on well-studied tasks. Our contribution includes extensive hyperparameter sweeps and a robust suite of evaluation metrics, including ROUGE scores, providing a more comprehensive analysis of RLHF algorithm performance. Our goal is to guide users in selecting the most effective RLHF algorithm and to promote a culture of thorough and impartial benchmarking.

Chat is not available.

Poster in Workshop: Statistical Frontiers in LLMs and Foundation Models

Back-to-Basics Revisited: Benchmarking an Expanded Set of RLHF Algorithms

Lucas Spangher · Rama Kumar Pasumarthi · Nick Masiewicki · Peter Grabowski · Eugene Ie · William Arnold · Daniele Calandriello · Bilal Piot

Poster
in
Workshop: Statistical Frontiers in LLMs and Foundation Models