Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Red Teaming GenAI: What Can We Learn from Adversaries?

LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet

Nathaniel Li · Ziwen Han · Ian Steneker · Willow Primack · Riley Goodside · Hugh Zhang · Zifan Wang · Cristina Menghini · Summer Yue

Keywords: [ ai security ] [ language models ] [ adversarial attacks ] [ robustness ] [ ai safety ]

[ ] [ Project Page ]
 
presentation: Red Teaming GenAI: What Can We Learn from Adversaries?
Sun 15 Dec 9 a.m. PST — 5:30 p.m. PST

Abstract:

Recent large language model (LLM) defenses have greatly improved models' ability to refuse harmful queries, even when adversarially attacked. However, LLM defenses are primarily evaluated against automated adversarial attacks in a single turn of conversation, an insufficient threat model for real-world malicious use. We demonstrate that multi-turn human jailbreaks uncover significant vulnerabilities, exceeding 70% attack success rate (ASR) on HarmBench against defenses that report single-digit ASRs with automated single-turn attacks. Human jailbreaks also reveal vulnerabilities in machine unlearning defenses, successfully recovering dual-use biosecurity knowledge from unlearned models. We compile these results into Multi-Turn Human Jailbreaks (MHJ), a dataset of 2,912 prompts across 537 multi-turn jailbreaks. We publicly release MHJ alongside a compendium of jailbreak tactics developed across dozens of commercial red teaming engagements, supporting research towards stronger LLM defenses.

Chat is not available.