Poster
in
Workshop: MATH-AI: The 4th Workshop on Mathematical Reasoning and AI
VinePPO: Accurate Credit Assignment in RL for LLM Mathematical Reasoning
Amirhossein Kazemnejad · Milad Aghajohari · Eva Portelance · Alessandro Sordoni · Siva Reddy · Aaron Courville · Nicolas Le Roux
Keywords: [ LLM ] [ Reasoning ] [ Reinforcement Learning ] [ Credit Assignment ]
Large language models (LLMs) are increasingly required to solve complex reasoning tasks, like mathematical problems, that involve multiple reasoning steps before feedback is received. Effectively identifying and prioritizing key steps by accurately assigning credit to these intermediate steps is essential for enhancing model performance. Proximal Policy Optimization (PPO), a state-of-the-art reinforcement learning algorithm for finetuning LLMs, addresses the credit assignment problem by employing value networks to predict the expected cumulative rewards of intermediate states. In this work, we identify significant limitations with this value estimation method. To address this, we propose \methodname that leverages the flexibility of language environments to compute unbiased Monte Carlo-based estimates of the intermediate values. VinePPO consistently outperforms standard PPO,doing so more efficiently and with lower divergence from the reference model. Our findings underscore the critical importance of accurate credit assignment in LLM post-training and present a simple, yet effective solution.