NeurIPS The Crucial Role of Samplers in Online Direct Preference Optimization

Poster
in
Workshop: Mathematics of Modern Machine Learning (M3L)

The Crucial Role of Samplers in Online Direct Preference Optimization

Ruizhe Shi · Runlong Zhou · Simon Du

Keywords: [ online DPO ] [ direct preference optimization ] [ multi-armed bandit ]

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

In this paper, we provide a rigorous analysis of DPO's convergence rates with different sampling strategies under the exact gradient setting, revealing a separation: uniform sampling achieves linear convergence, while our proposed online sampler achieves quadratic convergence. We further adapt the sampler to practical settings by incorporating posterior distributions and logit mixing, demonstrating significant improvements over previous approaches. Our results not only offer insights into the theoretical standing of DPO but also pave the way for potential algorithm designs in the future.

Chat is not available.

Poster in Workshop: Mathematics of Modern Machine Learning (M3L)

The Crucial Role of Samplers in Online Direct Preference Optimization

Ruizhe Shi · Runlong Zhou · Simon Du

Poster
in
Workshop: Mathematics of Modern Machine Learning (M3L)