Invited Talk
in
Workshop: Red Teaming GenAI: What Can We Learn from Adversaries?
Invited talk 2: Danqi Chen on Uncovering Simple Failures in Generative Models and How to Fix Them
Danqi Chen
Abstract:
Current large language models and image generation models undergo extensive safety tuning and alignment before release to ensure that their outputs are helpful and harmless, while avoiding the generation of copyrighted content from their training data. In this talk, I will focus on two simple attacks: first, exploiting different generation configurations of aligned language models to bypass the model's refusal behavior, and second, prompting image or video models to generate copyrighted content by indirectly anchoring on generic (yet relevant) keywords. I will conclude by discussing potential mitigation strategies to address these vulnerabilities.
Chat is not available.