Poster
in
Workshop: Red Teaming GenAI: What Can We Learn from Adversaries?
A Realistic Threat Model for Large Language Model Jailbreaks
Valentyn Boreiko · Alexander Panfilov · Vaclav Voracek · Matthias Hein · Jonas Geiping
Keywords: [ LLM ] [ jailbreaks ] [ threat model ] [ robustness ]
Sun 15 Dec 9 a.m. PST — 5:30 p.m. PST
A plethora of jailbreaking attacks have been proposed to obtain harmful responses from safety-tuned LLMs. In their original settings, these methods all largely succeed in coercing the target output. Yet, the resulting jailbreaks vary substantially in readability and computational effort. In this work, we propose a unified threat model for the principled comparison of these methods. Our threat model combines practical considerations with constraints along two dimensions: perplexity, which measures how far a jailbreak deviates from natural text, and computational budget in total FLOPs. For the former we built an N-gram model on 1T tokens, which, in contrast to model-based perplexity, allows for a neutral, LLM-agnostic, and intrinsically interpretable evaluation. Moreover, we adapt existing popular attacks to this threat model. Our threat model enables a comprehensive and precise comparison of various jailbreaking techniques within a single realistic framework. We further find that, under this threat model, even the most effective attacks, when thoroughly adapted, struggle to achieve success rates above 40% against safety-tuned models. This indicates that in a realistic chat scenario, current LLMs are less prone to attacks as it was believed before.