Poster
in
Workshop: Pluralistic Alignment Workshop
Diverging Preferences: Why do Annotators Sensibly Disagree?
Michael Zhang · Zhilin Wang · Jena Hwang · Yi Dong · Olivier Delalleau · Yejin Choi · Eunsol Choi · Xiang Ren · Valentina Pyatkin
We examine examples with diverging preferences, instances where annotators disagree on which response is preferred over the other, in human-labeled preference datasets. We develop a taxonomy of disagreement sources spanning 10 categories across four high-level classes - Task Underspecification, Response Style, Refusals, and Genuine Errors. These findings are in opposition with standard reward modeling approaches, which are designed with the assumption that disagreements in preference annotations are a symptom of undesirable noise rather than the true preferences of different users. In our experiments, we demonstrate how standard reward modeling methods, like the Bradley-Terry model, fail to differentiate whether a given preference judgment is the result of unanimous agreement among annotators or the majority opinion among diverging user preferences. This finding, that reward models tend to heavily favor a single response in cases of diverging preferences, highlights remaining challenges in training pluralistically aligned systems from human preferences.