Poster Session
in
Workshop: Scientific Methods for Understanding Neural Networks
Effectiveness of Sparse Autoencoder for understanding and removing gender bias in LLMs
Praveen Hegde
Gender bias in large language models (LLMs) perpetuates harmful stereotypes and unfair outcomes in AI applications. While traditional bias mitigation methods like fine-tuning and activation steering can be effective, they often require significant data modifications and computational resources. This paper highlights the dual utility of Sparse AutoEncoders (SAEs) in both detecting and mitigating these biases. We demonstrate how SAEs facilitate the identification of bias-inducing components within LLMs, enabling more targeted and efficient bias mitigation strategies without the need for extensive model retraining or specialized datasets. Our findings suggest that SAEs offer a promising approach for enhancing the interpretability and efficiency of bias mitigation processes in LLMs.