Poster
in
Workshop: Safe Generative AI
Stronger Universal and Transfer Attacks by Suppressing Refusals
David Huang · Avidan Shah · Alexandre Araujo · David Wagner · Chawin Sitawarin
Making large language models (LLMs) safe for mass deployment is a complexand ongoing challenge. Efforts have focused on aligning models to human prefer-ences in order to prevent malicious uses, essentially embedding a “safety feature”into the model’s parameters. The safety feature makes the LLM refuses to followany harmful instructions. The Greedy Coordinate Gradient (GCG) algorithm (Zouet al., 2023b) emerges as one of the most popular automated jailbreaks, an at-tack that circumvents such safety training. Our first unintuitive finding is thatan adversarial suffix discovered by GCG is inherently universal and transferable,even when optimized on a single model and a single harmful request. For in-stance, the best adversarial suffix (among 50) generated on Llama-2 universallyjailbreaks 92% of all the harmful requests. Among the same set of suffixes, therealso exist ones that universally and transferably jailbreak 86%, 56%, and 50% onMistral, Vicuna, and GPT-3.5-Turbo, respectively. We believe this suggests someadversarial suffixes operate by directly deactivating the safety feature rather thansimply forcing it to repeat a specific target string. Building upon this observa-tion as well as leveraging existing interpretability techniques, we introduce a newloss term to GCG to specifically deactivate the safety feature. Our final attack isable to improve the jailbreak success rate from 2% on Llama-3 (by GCG) to over56%. Most importantly, these same adversarial suffixes can universally transfer toproprietary models like GPT-3.5-Turbo (86%), GPT-4 (36%), and GPT-4o (22%)with no modification to system prompts or access to the output token probabili-ties. Under this same threat model, it also achieves 96% success rate against thestate-of-the-art RR defense (vs 2.5% by white-box GCG).