Poster
in
Workshop: Fine-Tuning in Modern Machine Learning: Principles and Scalability
Estimating Effects of Tokens in Preference Learning
Hsiao-Ru Pan · Maximilian Mordig · Bernhard Schölkopf
Abstract:
Recently, it was shown that the advantage function in reinforcement learning (RL) can be interpreted as the causal effect of actions on the return. In the present work, we first cast the problem of RL from human feedback (RLHF) with pairwise preference data as a two-player game and generalize Direct Advantage Estimation, a method for estimating the advantage function, to this natural language setting. This enables us to quantify and estimate the causal effects of tokens on the preference. We apply our method to the Anthropic HH-RLHF dataset and demonstrate that our method can estimate the effect of individual tokens on the overall preference.
Chat is not available.