NeurIPS BatchTopK Sparse Autoencoders

Poster Session
in
Workshop: Scientific Methods for Understanding Neural Networks

BatchTopK Sparse Autoencoders

Bart Bussmann · Patrick Leask · Neel Nanda

[ Abstract ] [ Project Page ]

[ OpenReview]

Sun 15 Dec 11:20 a.m. PST — 12:20 p.m. PST

Abstract:

Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting language model activations by decomposing them into sparse, interpretable features. A popular approach is the TopK SAE, that uses a fixed number of the most active latents per sample to reconstruct the model activations. We introduce BatchTopK SAEs, a training method that improves upon TopK SAEs by relaxing the top-k constraint to the batch-level, allowing for a variable number of latents to be active per sample. BatchTopK SAEs consistently outperform TopK SAEs at reconstructing activations from GPT-2 Small and Gemma 2 2B. BatchTopK SAEs achieve comparable reconstruction performance to the state-of-the-art JumpReLU SAE, but have the advantage that the average number of latents can be directly specified, rather than approximately tuned through a costly hyperparameter sweep. We provide code for training and evaluating these BatchTopK SAEs at [redacted].

Chat is not available.

Poster Session in Workshop: Scientific Methods for Understanding Neural Networks

BatchTopK Sparse Autoencoders

Bart Bussmann · Patrick Leask · Neel Nanda

Poster Session
in
Workshop: Scientific Methods for Understanding Neural Networks