Poster
in
Workshop: Interpretable AI: Past, Present and Future
A is for Absorption: Studying Sparse Autoencoder Feature Splitting and Absorption in Spelling Tasks
James Wilken-Smith · Tomáš Dulka · David Chanin · Hardik Bhatnagar · Joseph Bloom
While Large Language Models (LLMs) have become increasingly capable in recent years, scaling interpretability methods to match them remains a significant challenge. Sparse Autoencoders (SAEs) have emerged as a promising approach to decompose activations of LLMs into human-interpretable features. However, the criteria by which we should evaluate the quality of an SAE and how they might be useful in practice, remains a subject of debate. In this paper we use a simple first-letter identification task as a case study to evaluate the ability of SAEs to extract interpretable features from hidden activations. We find the correspondence of SAE features to the direction found by linear probes to be sensitive to width/sparsity of the SAE, and also identify a pernicious form of feature-splitting we call "feature absorption" which may present an obstacle to the practical use of SAEs.