Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Foundation Model Interventions

Steering semantic search with interpretable features from sparse autoencoders

Christine Ye · Charles O'Neill · John Wu · Kartheik Iyer

Keywords: [ intervention ] [ semantic search ] [ interpretability ]


Abstract: Modern information retrieval systems increasingly rely on dense neural vector embeddings, but dense embeddings of text are inherently difficult to interpret and steer, leading to opaque and potentially biased results. Sparse autoencoders (SAEs) have previously shown promise in extracting interpretable features from complex neural networks. In this work, we present the application of SAEs to dense text embeddings from large language models, demonstrating their effectiveness in disentangling document-level semantic concepts. By training SAEs on embeddings of over 420,000 scientific paper abstracts from computer science and astronomy, we show that the resulting sparse representations maintain semantic fidelity while offering high levels of interpretability. In the context of a semantic search system for scientific literature, we demonstrate that interpretable SAE features can be used to precisely steer information retrieval, allowing for fine-grained modifications of queries. At a given fidelity level to the original query, SAE feature interventions can be interpreted with $\sim$10\% higher accuracy, while maintaining overall quality of information retrieval. We open source our embeddings, trained sparse autoencoders, and interpreted features, as well as a web app for exploring them.

Chat is not available.