NeurIPS Towards Interpretable Scientific Foundation Models: Sparse Autoencoders for Disentangling Dense Embeddings of Scientific Concepts

Oral
in
Workshop: Foundation Models for Science: Progress, Opportunities, and Challenges

Towards Interpretable Scientific Foundation Models: Sparse Autoencoders for Disentangling Dense Embeddings of Scientific Concepts

Charles O'Neill · Christine Ye · Kartheik Iyer · John Wu

Keywords: [ AI for science ] [ foundation models ] [ large langauge models ] [ mechanistic interpretability ]

[ Abstract ] [ Project Page ]

[ OpenReview]

Sun 15 Dec 4:15 p.m. PST — 4:25 p.m. PST

presentation: Foundation Models for Science: Progress, Opportunities, and Challenges
Sun 15 Dec 8:30 a.m. PST — 5 p.m. PST

Abstract:

The prevalence of foundation models in scientific applications motivates the need for interpretable representations and search of scientific concepts. In this work, we present a novel approach using sparse autoencoders (SAEs) to disentangle dense embeddings from large language models, offering a pathway towards more interpretable scientific foundation models. By training SAEs on embeddings of over 425,000 scientific paper abstracts spanning computer science and astronomy, we demonstrate their effectiveness in extracting interpretable features while maintaining semantic fidelity. Our method reveals and analyzes SAE features that directly correspond to scientific concepts, and introduces a novel method for identifying `families' of related concepts at varying levels of abstraction. To illustrate the practical utility of our approach, we demonstrate how interpretable features from SAEs can precisely steer semantic search over scientific literature, allowing for fine-grained control over query semantics. This work not only bridges the gap between the semantic richness of dense embeddings and the interpretability needed for scientific applications, but also offers new directions for improving literature review and scientific discovery. For use by the scientific community, we open-source our embeddings, trained sparse autoencoders, and interpreted features, along with a web app for interactive literature search.

Chat is not available.

Oral in Workshop: Foundation Models for Science: Progress, Opportunities, and Challenges

Towards Interpretable Scientific Foundation Models: Sparse Autoencoders for Disentangling Dense Embeddings of Scientific Concepts

Charles O&#x27;Neill · Christine Ye · Kartheik Iyer · John Wu

Oral
in
Workshop: Foundation Models for Science: Progress, Opportunities, and Challenges

Charles O'Neill · Christine Ye · Kartheik Iyer · John Wu