Skip to yearly menu bar Skip to main content


Poster
in
Affinity Event: Queer in AI

The Root Shapes the Fruit: On the Persistence of Gender-Exclusive Harms in Aligned Language Models

Anaelia Ovalle · Krunoslav Lehman Pavasovic · Louis Martin · Luke Zettlemoyer · Eric Michael Smith · Kai-Wei Chang · Adina Williams · Levent Sagun

Keywords: [ Gender Bias ] [ Inclusive NLP ] [ LLM Alignment ] [ Human-Centered NLP ] [ Queer Bias ]


Abstract:

Natural-language assistants are designed to provide users with helpful responses while avoiding harmful outputs, largely achieved through alignment with human preferences. Yet there is limited understanding as to how pre-existing biases embedded within their base models persist or even amplify under alignment procedures.This is further limited by current bias evaluation practices being largely skewed towards dominant social categories like binary gender, leaving biases against minoritized populations poorly understood and therefore unaddressed.This work aims to understand these gaps more clearly, centering gender minorities in our investigation of gender bias across alignment stages.We conduct a systematic assessment of gender-diverse biases across 12 LLMs aligned with Direct Preference Optimization (DPO), uncovering harms that existing benchmarks fail to detect. We also introduce a novel evaluation framework to detect biases in implicit reward signals.While focused on gender-diverse contexts, this framework is adaptable to other social settings.Our findings reveal that alignment can inadvertently exacerbate existing gender-diverse disparities carried over from their base models, with model behavior particularly sensitive to the supervised finetuning stage.Our findings call for the development of comprehensive bias evaluation frameworks, created in collaboration with diverse sociocultural contexts, to address the inequalities present in both model outcomes and alignment procedures.

Live content is unavailable. Log in and register to view live content