Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Language Gamification

Sample Efficient Alignment for LLMs

Zichen Liu · Changyu CHEN · Chao Du · Wee Sun Lee · Min Lin


Abstract:

We study methods for efficiently aligning large language models (LLMs) with human preferences given budgeted online feedback. We first formulate the LLM alignment problem in the frame of contextual dueling bandits.This formulation, which subsumes recent paradigms such as online RLHF and online DPO, naturally quests for sample-efficient algorithms with active exploration strategies by its objective.Leveraging insights from bandit theory, we introduce a unified algorithm based on Thompson sampling and highlight its application in two distinct scenarios.Our proposed agent, termed as SEA (Sample Efficient Alignment), is empirically validated with extensive experiments, across three scales (1B, 2.8B, 6.9B) and three preference learning algorithms (DPO, IPO, SLiC). The results demonstrate that SEA achieves highly sample-efficient alignment with oracle's preferences, outperforming recent active exploration methods for LLMs. We will open-source our codebase to accelerate future research in this area.

Chat is not available.