Skip to yearly menu bar Skip to main content


Poster Session
in
Workshop: Scientific Methods for Understanding Neural Networks

Sparse autoencoders for dense text embeddings reveal hierarchical feature sub-structure

Christine Ye · Charles O'Neill · John Wu · Kartheik Iyer

[ ] [ Project Page ]
Sun 15 Dec 11:20 a.m. PST — 12:20 p.m. PST

Abstract:

Sparse autoencoders (SAEs) show promise in extracting interpretable features from complex neural networks, enabling examination and causal intervention in the inner workings of black-box models. However, the geometry and completeness of SAE features is not fully understood, limiting their interpretability and usefulness. In this work, we train SAEs to detangle dense text embeddings into highly interpretable document-level features. Our SAEs follow precise scaling laws as a function of capacity and compute, and exhibit higher interpretability scores compared to SAEs trained on language model activations. In embedding SAEs, we reproduce qualitative feature splitting" phenomena previously reported in language model SAEs, and demonstrate the existence of universal, cross-domain features. Finally, we suggest the existence offeature families" in SAEs, and develop a method to reveal distinct hierarchical clusters of related semantic concepts and map feature co-activations to a sparse block diagonal.

Chat is not available.