Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Safe Generative AI

Pruning for Robust Concept Erasing in Diffusion Models

Tianyun Yang · Ziniu Li · Juan Cao · Chang Xu


Abstract:

Despite the impressive capabilities of text-to-image diffusion models, they can also generate undesirable images, including not-safe-for-work content and copyrighted artworks. Recent studies have explored resolving this issue by fine-tuning model parameters to erase problematic concepts. However, existing methods exhibit a major flaw in robustness, as fine-tuned models often reproduce undesirable outputs when faced with cleverly crafted prompts. This reveals a fundamental limitation in current approaches and raises potential risks for deploying diffusion models in real-world scenarios. To bridge this gap, we show that concept-related hidden states, while deactivated by existing methods, can be reactivated under attacks, indicating incomplete and temporary blocking of concept generation path. In response, we introduce a simple yet efficient pruning-based framework for concept erasure. By integrating concept erasing and pruning into a single objective, our method effectively eliminating concept knowledge within models, while simultaneously cutting off pathways the pathways that could potentially reactivate the concept-related hidden states, ensuring robustness against adversarial prompts. Experiment results demonstrate a significant enhancement in our model's resilience to adversarial attacks. Compared with existing concept erasing methods, our method achieves about 30% improvement in erasing NSFW content and artwork style.

Chat is not available.