Poster
in
Workshop: Attributing Model Behavior at Scale (ATTRIB)
Between the Bars: Gradient-based Jailbreaks are Bugs that induce Features
Kaivalya Hariharan · Uzay Girit
Recent work has demonstrated that it is possible to find gradient-based jailbreaks (GBJs) that cause safety-tuned LLMs to output harmful text. Understanding the nature of these adversaries might lead to valuable insights for improving robustness and safety. We find that, even relative to very semantically similar baselines, GBJs are highly out-of-distribution for the model. Despite this, GBJs induce a structured change in models. We find that the activations of jailbreak and baseline prompts are separable with unsupervised methods alone. Using our understanding, we are able to steer models without training, both to be more or less susceptible to jailbreaking, and to output harmful text in response to harmful prompts without jailbreaking. Our findings suggest a picture of GBJs as "bugs" that induce more general "features" in the model: highly out of distribution text that induce understandable and potentially controllable changes in language models.