Poster
in
Workshop: Language Gamification
Sample Efficient Alignment for LLMs
Zichen Liu · Changyu CHEN · Chao Du · Wee Sun Lee · Min Lin
We study methods for efficiently aligning large language models (LLMs) with human preferences given budgeted online feedback. We first formulate the LLM alignment problem in the frame of contextual dueling bandits.This formulation, which subsumes recent paradigms such as online RLHF and online DPO, naturally quests for sample-efficient algorithms with active exploration strategies by its objective.Leveraging insights from bandit theory, we introduce a unified algorithm based on Thompson sampling and highlight its application in two distinct scenarios.Our proposed agent, termed as SEA (Sample Efficient Alignment), is empirically validated with extensive experiments, across three scales (1B, 2.8B, 6.9B) and three preference learning algorithms (DPO, IPO, SLiC). The results demonstrate that SEA achieves highly sample-efficient alignment with oracle's preferences, outperforming recent active exploration methods for LLMs. We will open-source our codebase to accelerate future research in this area.