NeurIPS Jailbreak Defense in a Narrow Domain: Failures of Existing Methods and Improving Transcript-Based Classifiers

Poster
in
Workshop: Socially Responsible Language Modelling Research (SoLaR)

Jailbreak Defense in a Narrow Domain: Failures of Existing Methods and Improving Transcript-Based Classifiers

Tony Wang · John Hughes · Henry Sleight · Rylan Schaeffer · Rajashree Agrawal · Fazl Barez · Mrinank Sharma · Jesse Mu · Nir Shavit · Ethan Perez

Keywords: [ jailbreak ] [ defense ] [ adversarial robustness ] [ AI safety ] [ robustness ]

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

Defending large language models against jailbreaks so that they never engage in a broad set of forbidden behaviors is an open-problem. In this paper, we study if jailbreak-defense is more tractable if one only needs to forbid a very narrow set of behaviors. In particular, we focus on preventing an LLM from helping a user make a bomb, and find that popular defenses such as safety training, adversarial training, and input/output classifiers are inadequate. In pursuit of a better defense, we develop our own classifier-defense tailored to our bomb-setting, which outperforms existing defenses on some axes but is still ultimately broken. We conclude that jailbreak-defense is challenging even in a narrow domain.

Chat is not available.

Poster in Workshop: Socially Responsible Language Modelling Research (SoLaR)

Jailbreak Defense in a Narrow Domain: Failures of Existing Methods and Improving Transcript-Based Classifiers

Tony Wang · John Hughes · Henry Sleight · Rylan Schaeffer · Rajashree Agrawal · Fazl Barez · Mrinank Sharma · Jesse Mu · Nir Shavit · Ethan Perez

Poster
in
Workshop: Socially Responsible Language Modelling Research (SoLaR)