Skip to yearly menu bar Skip to main content


Spotlight Poster

Leveraging Catastrophic Forgetting to Develop Safe Diffusion Models against Malicious Finetuning

Jiadong Pan · Hongcheng Gao · Zongyu Wu · Taihang Hu · Li Su · Liang Li · Qingming Huang

[ ]
Thu 12 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

Diffusion models (DMs) have demonstrated remarkable proficiency in generating images based on textual prompts. To ensure these models generate safe images, numerous methods have been proposed. Early methods attempt to incorporate safety filters into models to mitigate the risk of generating harmful images but such external filters do not inherently detoxify the model and can be easily bypassed. Hence, model unlearning and data cleaning are the most essential methods for maintaining the safety of models, given their impact on model parameters.However, malicious fine-tuning can still make models prone to generating harmful or undesirable images even with these methods.Inspired by the phenomenon of catastrophic forgetting, we propose a training policy using contrastive learning to increase the latent space distance between clean and harmful data distribution, thereby protecting models from being fine-tuned to generate harmful images due to forgetting.Specifically, we have two instantiations: transforming the data latent and guiding diffusion noise. Latent transformation refers to the operation of transforming the latent variable distribution of images. Noise guidance is adding different noises to clean and harmful images to induce different changes in the distribution of images. The experimental results demonstrate that both of our instantiations not only maintain superior image generation capabilities before malicious fine-tuning but also effectively prevent DMs from producing harmful images after malicious fine-tuning. Our method can also be combined with other safety methods to further maintain their safety against malicious fine-tuning.

Live content is unavailable. Log in and register to view live content