Poster
in
Workshop: Workshop on Behavioral Machine Learning
pSAE-chiatry: Utilizing Sparse Autoencoders to Uncover Mental-Health-Related Features in Language Models
Declan Grabb
As AI-powered mental health chatbots become more prevalent, their inability to recognize and respond to psychiatric emergencies, such as suicidality and mania, raises significant safety concerns. In this study, I explore the internal representations of mental-health-related features (MHRF) in the Gemma-2-2B language model, focusing on crises related to suicide, mania, and psychosis. Using sparse autoencoders and psychiatric expertise, I identified MHRF across all 25 layers of the model, finding 29 features related to suicide and 42 to sadness. However, I did not identify any features related to mania or paranoia, suggesting critical gaps in the model’s ability to handle complex psychiatric symptoms. Furthermore, when compared to prompts related to homicide, suicide-related prompts triggered higher activation of a suicide-related feature, supporting the relevance of the identified features. As a proof-of-concept, I demonstrate that steering Gemma-2-2B through the enhancement of a suicide-related MHRF causally impacts model behavior. These findings underscore the need for improved feature identification and modulation within AI models to enhance their safety and effectiveness in mental health care applications. Future work should focus on amplifying helpful MHRFs while suppressing harmful ones to prevent AI from causing unintended harm in psychiatric crises. Trigger warning: This paper contains sensitive mental health topics, including suicide.