Poster
in
Workshop: Socially Responsible Language Modelling Research (SoLaR)
AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models
Sicheng Zhu · Ruiyi Zhang · Bang An · Gang Wu · Joe Barrow · Zichao Wang · Furong Huang · Ani Nenkova · Tong Sun
Large Language Models (LLMs) exhibit broad utility in diverse applications but remain vulnerable to jailbreak attacks, including hand-crafted and automated adversarial attacks, which can compromise their safety measures. However, recent work suggests that patching LLMs against these attacks is possible: manual jailbreak attacks are human-readable but often limited and public, making them easy to block, while automated adversarial attacks generate gibberish prompts that can be detected using perplexity-based filters. In this paper, we propose an interpretable adversarial attack, \texttt{AutoDAN}, that combines the strengths of both types of attacks. It automatically generates attack prompts that bypass perplexity-based filters while maintaining a high attack success rate like manual jailbreak attacks. These prompts are interpretable, exhibiting strategies commonly used in manual jailbreak attacks. Moreover, these interpretable prompts transfer better than their non-readable counterparts, especially when using limited data or a single proxy model. Beyond eliciting harmful content, we also customize the objective of \texttt{AutoDAN} to leak system prompts, demonstrating its versatility. Our work underscores the seemingly intrinsic vulnerability of LLMs to interpretable adversarial attacks.