NeurIPS $\textit{Who Speaks Matters}$: Analysing the Influence of the Speaker’s Ethnicity on Hate Classification

Oral
in
Workshop: Safe Generative AI

$\textit{Who Speaks Matters}$: Analysing the Influence of the Speaker’s Ethnicity on Hate Classification

Ananya Malik · Kartik Sharma · Lynnette Hui Xian Ng · Shaily Bhatt

[ Abstract ] [ Project Page ]

[ OpenReview]

presentation: Safe Generative AI
Sun 15 Dec 9 a.m. PST — 5 p.m. PST

Abstract:

Large Language Models (LLMs) offer a lucrative promise for scalable content moderation, including hate speech detection. However, they are also known to be brittle and biased against marginalised communities and dialects. This requires their applications to high stakes tasks like hate speech detection to be critically scrutinized. In this work, we investigate the robustness of LLMs hate speech classification, when explicit and implicit markers of the speaker's ethnicity are injected in the input. For the explicit markers, we inject a phrase that mentions the speaker's identity and for the implicit markers, we inject dialectal features. By analysing how frequently model outputs flip in the presence of these markers, we reveal varying degrees of brittleness across 4 popular LLMs and 5 ethnicities. We find that the presence of implicit dialect markers in inputs causes model outputs to flip more than the presence of explicit markers. Further, the percentage of flips varies across ethnicities. Finally, we find that larger models are more robust. Our findings indicate the need for exercising caution in deploying LLMs for high-stakes tasks like hate speech detection.

Chat is not available.

Oral in Workshop: Safe Generative AI

$\textit{Who Speaks Matters}$: Analysing the Influence of the Speaker’s Ethnicity on Hate Classification

Ananya Malik · Kartik Sharma · Lynnette Hui Xian Ng · Shaily Bhatt

Oral
in
Workshop: Safe Generative AI