Poster
in
Workshop: 3rd Workshop on New Frontiers in Adversarial Machine Learning (AdvML-Frontiers)
Jailbreak Defense in a Narrow Domain: Failures of existing methods and Improving Transcript-Based Classifiers
Tony Wang · John Hughes · Henry Sleight · Rylan Schaeffer · Rajashree Agrawal · Fazl Barez · Mrinank Sharma · Jesse Mu · Nir Shavit · Ethan Perez
Keywords: [ jailbreak ] [ defense ] [ Adversarial Robustness ] [ AI safety ] [ robustness ]
Due to the known difficulties of preventing competent large language model (LLM) failures in wide domains, we study the problem of jailbreak-defense in a narrow domain, focusing on the special case of preventing an LLM from helping a user make a bomb. We find that existing defenses like safety training, adversarial training, and input/output classifiers cannot prevent a model from helping a user make a bomb. We also develop our own classifier-defense specially tailored to our bomb-setting, which outperforms existing defenses on some axes, but is still ultimately broken. We conclude that jailbreak-defense is challenging even in narrow-domains.