Oral
in
Workshop: Language Gamification
Evolving Alignment via Asymmetric Self-Play
Ziyu Ye · Rishabh Agarwal · Tianqi Liu · Rishabh Joshi · Sarmishta Velury · Quoc V Le · Qijun Tan · Yuan Liu
Sat 14 Dec 8:20 a.m. PST — 5:30 p.m. PST
Current RLHF approaches for aligning large language models (LLMs) typically assume a fixed prompt distribution, which is sub-optimal and limits the generalization capabilities for language models. To address this issue, we introduce a general framework that casts alignment as an asymmetric game between two players: (i) a creator, which strategically generates informative prompt distributions using reward signals, and (ii) a solver, which learns to produce preferred responses on prompts produced by the creator.This framework of Evolving Alignment via Asymmetric Self-Play (eva
), results in a simple and efficient approach that can utilize any existing RLHF algorithm. eva achieves a new state of the art in widely adopted alignment benchmarks, without the need of any additional human crafted prompts, e.g., it can improve the win rate of finetuned gemma-2-9b-it on Arena-Hard from 51.6% to 60.1% with DPO, from 55.7% to 58.9% with SPPO, from 52.3% to 60.7% with SimPO, and from 54.8% to 60.3% with ORPO, surpassing its 27B version and matching Claude-3-opus. Finally, we show eva is effective and robust under various ablation settings.We hope eva
can serve as a scalable methodology for the research community to build open-ended, robust, and self-improving language agents, that align with human values.
Ziyu Ye, Rishabh Agarwal, Tianqi Liu, Rishabh Joshi, Sarmishta Velury, Quoc V Le, Qijun Tan, Yuan Liu