Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Causality and Large Models

Estimating Effects of Tokens in Preference Learning

Hsiao-Ru Pan · Maximilian Mordig · Bernhard Schölkopf

Keywords: [ preference learning ] [ causal effect ] [ rlhf ]


Abstract:

Recently, it was shown that the advantage function in reinforcement learning (RL) can be interpreted as the causal effect of actions on the return. In the present work, we first cast the problem of RL from human feedback (RLHF) with pairwise preference data as a two-player game and generalize Direct Advantage Estimation, a method for estimating the advantage function, to this natural language setting. This enables us to quantify and estimate the causal effects of tokens on the preference. We apply our method to the Anthropic HH-RLHF dataset and demonstrate that our method can estimate the effect of individual tokens on the overall preference.

Chat is not available.