NeurIPS Between the Bars: Gradient-based Jailbreaks are Bugs that induce Features

Poster
in
Workshop: Red Teaming GenAI: What Can We Learn from Adversaries?

Between the Bars: Gradient-based Jailbreaks are Bugs that induce Features

Kaivalya Hariharan · Uzay Girit

Keywords: [ Adversarial Robustness ] [ Jailbreaks ] [ Interpretability ] [ Safety ]

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

Recent work has demonstrated that it is possible to find gradient-based jailbreaks (GBJs) that cause safety-tuned LLMs to output harmful text. Understanding the nature of model computation on these adversaries is essential for better robustness and safety. We find that these GBJs are highly out-of-distribution for the model, even relative to very semantically similar baselines, and that they are easily separated from these baselines with unsupervised clustering alone. Despite this, GBJs have elements of structure. We find that attacks are semantically similar to their induced outputs, and that the distribution of model features. Using our understanding, we are able to steer models without training, both to be more or less susceptible to jailbreaking, and to output harmful text in response to harmful prompts without jailbreaking. Our findings suggest a picture of GBJs as "bugs" that induce more general "features" in the model: highly out of distribution text that induce understandable and potentially controllable changes in language models.

Chat is not available.

Poster in Workshop: Red Teaming GenAI: What Can We Learn from Adversaries?

Between the Bars: Gradient-based Jailbreaks are Bugs that induce Features

Kaivalya Hariharan · Uzay Girit

Poster
in
Workshop: Red Teaming GenAI: What Can We Learn from Adversaries?