Poster
in
Workshop: Attributing Model Behavior at Scale (ATTRIB)
Evaluating Sparse Autoencoders for Controlling Open-Ended Text Generation
Aleksandar Makelov · Nathaniel Monson · Julius Adebayo
Sparse autoencoders (SAEs) have surged in popularity within the mechanisticinterpretability community as a promising method to scalably disentangle theactivations of large language models (LLMs) into human-interpretable ‘variables’.However, there is still insufficient understanding of how to evaluate SAEs viametrics more meaningful than hard-to-interpret aggregates like reconstruction lossand sparsity.As a step toward this, in this paper we propose evaluations of the usefulness ofSAEs for precisely controlling open-ended text generation in the ‘minimum viablesetup’ provided by the Tiny Stories dataset and models [Eldan and Li, 2023]. Weextend recent work [Conmy and Nanda, 2024] by building an automatic pipeline toextract steering vectors [Turner et al., 2023] for concepts in texts in an unsupervisedway, and compare these steering vectors to SAE latents.After evaluating the effectiveness of our steering vectors for controlling text gener-ation, we look for SAE latents with similar control properties. We find that, in asignificant number of cases, single SAE latents can improve upon the Pareto frontbetween steering success and coherence compared to steering vectors.