Poster
in
Workshop: Safe Generative AI
HSpace Sparse Autoencoders
Ayodeji Ijishakin · Ming Ang · Levente Baljer · Daniel Tan · Hugo Fry · Ahmed Abdulaal · Aengus Lynch
Abstract:
In this work, we introduce a computationally efficient method that allows Sparse Autoencoders (SAEs) to automatically detect interpretable directions within the latent space of diffusion models. We show that intervening on a single neuron in SAE representation space at a single diffusion time step leads to meaningful feature changes in model output. This marks a step toward applying techniques from mechanistic interpretability to controlling the outputs of diffusion models, further ensuring the safety of their generations. As such, we establish a connection between safety/interpretability methods from language modelling and image generative modelling.
Chat is not available.