NeurIPS Invited talk 2: Danqi Chen on Uncovering Simple Failures in Generative Models and How to Fix Them

Invited Talk
in
Workshop: Red Teaming GenAI: What Can We Learn from Adversaries?

Invited talk 2: Danqi Chen on Uncovering Simple Failures in Generative Models and How to Fix Them

Danqi Chen

[ Abstract ]

Sun 15 Dec 10:10 a.m. PST — 10:45 a.m. PST

Abstract:

Current large language models and image generation models undergo extensive safety tuning and alignment before release to ensure that their outputs are helpful and harmless, while avoiding the generation of copyrighted content from their training data. In this talk, I will focus on two simple attacks: first, exploiting different generation configurations of aligned language models to bypass the model's refusal behavior, and second, prompting image or video models to generate copyrighted content by indirectly anchoring on generic (yet relevant) keywords. I will conclude by discussing potential mitigation strategies to address these vulnerabilities.

Chat is not available.

Invited Talk in Workshop: Red Teaming GenAI: What Can We Learn from Adversaries?

Invited talk 2: Danqi Chen on Uncovering Simple Failures in Generative Models and How to Fix Them

Danqi Chen

Invited Talk
in
Workshop: Red Teaming GenAI: What Can We Learn from Adversaries?