NeurIPS Can sparse autoencoders be used to decompose and interpret steering vectors?

Poster
in
Workshop: Foundation Model Interventions

Can sparse autoencoders be used to decompose and interpret steering vectors?

Harry Mayne · Yushi Yang · Adam Mahdi

Keywords: [ representation engineering ] [ sparse autoencoders ] [ Steering vectors ] [ steering vector interpretability ]

[ Abstract ] [ Project Page ]

[ Poster] [ OpenReview]

Abstract:

Steering vectors offer a promising approach to controlling large language models, demonstrating potential in regulating sycophancy, harmlessness, and refusal. However, their underlying mechanisms are not well understood. One natural approach is to use sparse autoencoders (SAEs), yet recent work found that the resulting decompositions were peculiar, with reconstructed vectors often lacking the steering properties of the original vectors. This paper explores why applying SAEs to steering vectors leads to misleading decompositions, identifying two issues: (1) steering vectors fall outside the input distribution for which SAEs are designed, and (2) steering vectors can have meaningful negative projections in feature directions, which SAEs cannot accommodate. These issues fundamentally limit the direct use of SAEs to interpret steering vectors.

Chat is not available.

Poster in Workshop: Foundation Model Interventions

Can sparse autoencoders be used to decompose and interpret steering vectors?

Harry Mayne · Yushi Yang · Adam Mahdi

Poster
in
Workshop: Foundation Model Interventions