Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Mathematics of Modern Machine Learning (M3L)

Declarative characterizations of direct preference alignment algorithms

Kyle Richardson · Vivek Srikumar · Ashish Sabharwal

Keywords: [ LLM alignemtn ] [ logical inference ] [ neuro-symbolic ]


Abstract:

Recent direct preference alignment algorithms (DPA), such as DPO, have showngreat promise in aligning large language models to human preferences. While thishas motivated the development of many new variants of the original DPO loss,understanding the differences between these recent proposals, as well as developing new DPA loss functions, remains difficult given the lack of a technical andconceptual framework for reasoning about the underlying semantics of these algorithms. In this paper, we attempt to remedy this by formalizing DPA lossesin terms of discrete reasoning problems. Specifically, we ask: Given an existingDPA loss, can we systematically derive a symbolic expression that characterizesits semantics? How do the semantics of two losses relate to each other? We propose a novel formalism for characterizing preference losses for single model andreference model based approaches, and identify symbolic forms for a number ofcommonly used DPA variants. Further, we show how this formal view of preference learning sheds new light on both the size and structure of the DPA losslandscape, making it possible to not only rigorously characterize the relationshipsbetween recent loss proposals but also to systematically explore the landscape andderive new loss functions from first principles. We hope our framework and findings will help provide useful guidance to those working on human AI alignment.

Chat is not available.