Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Regulatable ML: Towards Bridging the Gaps between Machine Learning Research and Regulations

Fairness Implications of Machine Unlearning: Bias Risks in Removing NSFW Content from Text-to-Image Models

Xiwen Wei · Guihong Li · Radu Marculescu


Abstract:

The rapid development of large-scale text-to-image generative models has raised significant concerns about their potential misuse in generating harmful, misleading, or inappropriate content. To address these safety issues, various machine unlearning methods have been proposed to efficiently remove not-safe-for-work (NSFW) content without the need for complete model re-training. While these unlearning methods effectively enhance model safety, their impact on model fairness remains largely unexplored. In this paper, we examine the fairness implications of NSFW content removal via machine unlearning and discover that some methods can unintentionally amplify existing biases, increasing them by up to 6x. Our findings reveal that this increased bias arises from the biased synthetic training data used during the unlearning process. To mitigate this bias, we employ Bayesian optimization to identify the optimal training data composition, thus balancing safety and fairness.

Chat is not available.