NeurIPS Estimating Effects of Tokens in Preference Learning

Oral
in
Workshop: Language Gamification

Estimating Effects of Tokens in Preference Learning

Hsiao-Ru Pan · Maximilian Mordig · Bernhard Schölkopf

[ Abstract ] [ Project Page ]

[ OpenReview]

Sat 14 Dec 10:05 a.m. PST — 10:10 a.m. PST

presentation: Language Gamification
Sat 14 Dec 8:20 a.m. PST — 5:30 p.m. PST

Abstract:

Recently, it was shown that the advantage function in reinforcement learning (RL) can be interpreted as the causal effect of actions on the return. In the present work, we first cast the problem of RL from human feedback (RLHF) with pairwise preference data as a two-player game and generalize Direct Advantage Estimation, a method for estimating the advantage function, to this natural language setting. This enables us to quantify and estimate the causal effects of tokens on the preference. We apply our method to the Anthropic HH-RLHF dataset and demonstrate that our method can estimate the effect of individual tokens on the overall preference.

Chat is not available.

Oral in Workshop: Language Gamification

Estimating Effects of Tokens in Preference Learning

Hsiao-Ru Pan · Maximilian Mordig · Bernhard Schölkopf

Oral
in
Workshop: Language Gamification