Oral
in
Workshop: Language Gamification
Estimating Effects of Tokens in Preference Learning
Hsiao-Ru Pan · Maximilian Mordig · Bernhard Schölkopf
[
Abstract
]
[ Project Page ]
presentation:
Language Gamification
Sat 14 Dec 8:20 a.m. PST — 5:30 p.m. PST
[
OpenReview]
Sat 14 Dec 10:05 a.m. PST
— 10:10 a.m. PST
Sat 14 Dec 8:20 a.m. PST — 5:30 p.m. PST
Abstract:
Recently, it was shown that the advantage function in reinforcement learning (RL) can be interpreted as the causal effect of actions on the return. In the present work, we first cast the problem of RL from human feedback (RLHF) with pairwise preference data as a two-player game and generalize Direct Advantage Estimation, a method for estimating the advantage function, to this natural language setting. This enables us to quantify and estimate the causal effects of tokens on the preference. We apply our method to the Anthropic HH-RLHF dataset and demonstrate that our method can estimate the effect of individual tokens on the overall preference.
Chat is not available.