Oral
in
Workshop: Foundation Models for Science: Progress, Opportunities, and Challenges
Towards Interpretable Scientific Foundation Models: Sparse Autoencoders for Disentangling Dense Embeddings of Scientific Concepts
Charles O'Neill · Christine Ye · Kartheik Iyer · John Wu
Keywords: [ AI for science ] [ foundation models ] [ large langauge models ] [ mechanistic interpretability ]
Sun 15 Dec 8:30 a.m. PST — 5 p.m. PST
The prevalence of foundation models in scientific applications motivates the need for interpretable representations and search of scientific concepts. In this work, we present a novel approach using sparse autoencoders (SAEs) to disentangle dense embeddings from large language models, offering a pathway towards more interpretable scientific foundation models. By training SAEs on embeddings of over 425,000 scientific paper abstracts spanning computer science and astronomy, we demonstrate their effectiveness in extracting interpretable features while maintaining semantic fidelity. Our method reveals and analyzes SAE features that directly correspond to scientific concepts, and introduces a novel method for identifying `families' of related concepts at varying levels of abstraction. To illustrate the practical utility of our approach, we demonstrate how interpretable features from SAEs can precisely steer semantic search over scientific literature, allowing for fine-grained control over query semantics. This work not only bridges the gap between the semantic richness of dense embeddings and the interpretability needed for scientific applications, but also offers new directions for improving literature review and scientific discovery. For use by the scientific community, we open-source our embeddings, trained sparse autoencoders, and interpreted features, along with a web app for interactive literature search.