NeurIPS VinePPO: Accurate Credit Assignment in RL for LLM Mathematical Reasoning

Poster
in
Workshop: MATH-AI: The 4th Workshop on Mathematical Reasoning and AI

VinePPO: Accurate Credit Assignment in RL for LLM Mathematical Reasoning

Amirhossein Kazemnejad · Milad Aghajohari · Eva Portelance · Alessandro Sordoni · Siva Reddy · Aaron Courville · Nicolas Le Roux

Keywords: [ LLM ] [ Reasoning ] [ Reinforcement Learning ] [ Credit Assignment ]

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

Large language models (LLMs) are increasingly required to solve complex reasoning tasks, like mathematical problems, that involve multiple reasoning steps before feedback is received. Effectively identifying and prioritizing key steps by accurately assigning credit to these intermediate steps is essential for enhancing model performance. Proximal Policy Optimization (PPO), a state-of-the-art reinforcement learning algorithm for finetuning LLMs, addresses the credit assignment problem by employing value networks to predict the expected cumulative rewards of intermediate states. In this work, we identify significant limitations with this value estimation method. To address this, we propose \methodname that leverages the flexibility of language environments to compute unbiased Monte Carlo-based estimates of the intermediate values. VinePPO consistently outperforms standard PPO,doing so more efficiently and with lower divergence from the reference model. Our findings underscore the critical importance of accurate credit assignment in LLM post-training and present a simple, yet effective solution.

Chat is not available.

Poster in Workshop: MATH-AI: The 4th Workshop on Mathematical Reasoning and AI

VinePPO: Accurate Credit Assignment in RL for LLM Mathematical Reasoning

Amirhossein Kazemnejad · Milad Aghajohari · Eva Portelance · Alessandro Sordoni · Siva Reddy · Aaron Courville · Nicolas Le Roux

Poster
in
Workshop: MATH-AI: The 4th Workshop on Mathematical Reasoning and AI