Poster Session
in
Workshop: Scientific Methods for Understanding Neural Networks
Sparse autoencoders for dense text embeddings reveal hierarchical feature sub-structure
Christine Ye · Charles O'Neill · John Wu · Kartheik Iyer
Sparse autoencoders (SAEs) show promise in extracting interpretable features from complex neural networks, enabling examination and causal intervention in the inner workings of black-box models. However, the geometry and completeness of SAE features is not fully understood, limiting their interpretability and usefulness. In this work, we train SAEs to detangle dense text embeddings into highly interpretable document-level features. Our SAEs follow precise scaling laws as a function of capacity and compute, and exhibit higher interpretability scores compared to SAEs trained on language model activations. In embedding SAEs, we reproduce qualitative feature splitting" phenomena previously reported in language model SAEs, and demonstrate the existence of universal, cross-domain features. Finally, we suggest the existence of
feature families" in SAEs, and develop a method to reveal distinct hierarchical clusters of related semantic concepts and map feature co-activations to a sparse block diagonal.