NeurIPS Jailbreak Defense in a Narrow Domain: Failures of existing methods and Improving Transcript-Based Classifiers

Poster
in
Workshop: 3rd Workshop on New Frontiers in Adversarial Machine Learning (AdvML-Frontiers)

Jailbreak Defense in a Narrow Domain: Failures of existing methods and Improving Transcript-Based Classifiers

Tony Wang · John Hughes · Henry Sleight · Rylan Schaeffer · Rajashree Agrawal · Fazl Barez · Mrinank Sharma · Jesse Mu · Nir Shavit · Ethan Perez

Keywords: [ jailbreak ] [ defense ] [ Adversarial Robustness ] [ AI safety ] [ robustness ]

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

Due to the known difficulties of preventing competent large language model (LLM) failures in wide domains, we study the problem of jailbreak-defense in a narrow domain, focusing on the special case of preventing an LLM from helping a user make a bomb. We find that existing defenses like safety training, adversarial training, and input/output classifiers cannot prevent a model from helping a user make a bomb. We also develop our own classifier-defense specially tailored to our bomb-setting, which outperforms existing defenses on some axes, but is still ultimately broken. We conclude that jailbreak-defense is challenging even in narrow-domains.

Chat is not available.

Poster in Workshop: 3rd Workshop on New Frontiers in Adversarial Machine Learning (AdvML-Frontiers)

Jailbreak Defense in a Narrow Domain: Failures of existing methods and Improving Transcript-Based Classifiers

Tony Wang · John Hughes · Henry Sleight · Rylan Schaeffer · Rajashree Agrawal · Fazl Barez · Mrinank Sharma · Jesse Mu · Nir Shavit · Ethan Perez

Poster
in
Workshop: 3rd Workshop on New Frontiers in Adversarial Machine Learning (AdvML-Frontiers)