NeurIPS HSpace Sparse Autoencoders

Poster
in
Workshop: Safe Generative AI

HSpace Sparse Autoencoders

Ayodeji Ijishakin · Ming Ang · Levente Baljer · Daniel Tan · Hugo Fry · Ahmed Abdulaal · Aengus Lynch

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

In this work, we introduce a computationally efficient method that allows Sparse Autoencoders (SAEs) to automatically detect interpretable directions within the latent space of diffusion models. We show that intervening on a single neuron in SAE representation space at a single diffusion time step leads to meaningful feature changes in model output. This marks a step toward applying techniques from mechanistic interpretability to controlling the outputs of diffusion models, further ensuring the safety of their generations. As such, we establish a connection between safety/interpretability methods from language modelling and image generative modelling.

Chat is not available.

Poster in Workshop: Safe Generative AI

HSpace Sparse Autoencoders

Ayodeji Ijishakin · Ming Ang · Levente Baljer · Daniel Tan · Hugo Fry · Ahmed Abdulaal · Aengus Lynch

Poster
in
Workshop: Safe Generative AI