NeurIPS Testing the Limits of Jailbreaking with the Purple Problem

Poster
in
Workshop: Safe Generative AI

Testing the Limits of Jailbreaking with the Purple Problem

Taeyoun Kim · Suhas Kotha · Aditi Raghunathan

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

The rise of ''jailbreak'' attacks on language models has led to a flurry of defenses aimed at preventing undesirable responses. Nonetheless, most benchmarks remain to be solved, not to mention real-world safety problems. We critically examine the two stages of the defense pipeline: (i) defining what constitutes unsafe outputs, and (ii) enforcing the definition via methods such as fine-tuning or input preprocessing. To understand whether we fail because of definition or enforcement, we consider a simple and well-specified definition of unsafe outputs---outputs that contain the word ''purple''. Surprisingly, all existing fine-tuning and input defenses fail to enforce this definition under adaptive attacks and increasing compute, casting doubt on whether enforcement algorithms can be robust for more complicated definitions. We hope that this definition serves as a testbed to evaluate enforcement algorithms and prevent a false sense of security.

Chat is not available.

Poster in Workshop: Safe Generative AI

Testing the Limits of Jailbreaking with the Purple Problem

Taeyoun Kim · Suhas Kotha · Aditi Raghunathan

Poster
in
Workshop: Safe Generative AI