NeurIPS "There are no solutions, only trade-offs.'' Taking A Closer Look At Safety Data Annotations.

Poster
in
Workshop: Pluralistic Alignment Workshop

"There are no solutions, only trade-offs.'' Taking A Closer Look At Safety Data Annotations.

Elle Michelle Yang · Matthias Gallé · Seraphina Goldfarb-Tarrant

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract: AI alignment, the last step in the training pipeline, ensures that large language models model desirable goals and values to improve helpfulness, reliability, and safety. Existing approaches typically rely on supervised learning algorithms with data labeled by human annotators. But sociodemographic and personal contexts are at play in annotating for alignment objectives. In safety alignment particularly, the labels are generally confusing, and the moral ethics of "What $\textit{should}$ an LLM do?" is even more perplexing and lacks a clear ground truth. We seek to understand the effects of aggregation on multi-annotated datasets with demographically diverse participants, particularly the implications for safety on subjective preferences. This paper offers quantitative and qualitative analysis of aggregation methods on safety data and their potential ramifications on alignment. Our results show that safety annotations are mutually contradictory and that existing strategies to reconcile these disagreements fail to remove this contradiction. Crucially, we find that annotator labels are sensitive to intersectional differences erased by existing aggregation methods. We additionally explore evaluation perspectives from social choice theory. Our findings suggest that social welfare metrics offer insights on the relative disadvantages to minority groups.

Chat is not available.

Poster in Workshop: Pluralistic Alignment Workshop

"There are no solutions, only trade-offs.'' Taking A Closer Look At Safety Data Annotations.

Elle Michelle Yang · Matthias Gallé · Seraphina Goldfarb-Tarrant

Poster
in
Workshop: Pluralistic Alignment Workshop