Poster Session
in
Workshop: Scientific Methods for Understanding Neural Networks
Stitching Sparse Autoencoders of Different Sizes
Patrick Leask · Bart Bussmann · Joseph Bloom · Curt Tigges · Noura Al Moubayed · Neel Nanda
Sparse autoencoders (SAEs) are a promising method for decomposing the activations of language models into a learned dictionary of latents, the size of which is a key hyperparameter. However, the effect of the dictionary size hyperparameter on the learned latents remains poorly understood. In this work, we investigate how increasing the dictionary size of SAEs trained on the activations of GPT-2 and Pythia-410M affects their latents. We find that latents in SAEs fall into two distinct categories. There are reconstruction latents that are either present in smaller SAEs or are more fine-grained versions of them, but we also find novel latents that capture information missed by smaller SAEs. Novel latents can be inserted into a smaller SAE to improve performance, while reconstruction latents degrade it. The existence of novel latents when larger SAEs are trained suggests that researchers may be using SAEs which miss out on features crucial to the task studied. The category of a latent can be effectively predicted with the cheap proxy of taking the maximum cosine similarity with each latent in the smaller SAE's decoder: novel latents have low cosine similarity, whereas reconstruction have high. Utilizing this insight, we introduce SAE stitching: a method that inserts or swaps novel latents from a larger SAE into a smaller one, allowing for smooth interpolation between SAE sizes with monotonically decreasing reconstruction error. Our findings shed light on the trade-offs between dictionary size, sparsity, and reconstruction performance in SAEs, enhancing the understanding of feature learning in these models.