Poster
in
Workshop: Safe Generative AI
LLM Improvement for Jailbreak Defense: Analysis Through the Lens of Over-Refusal
Swetasudha Panda · Naveen Jafer Nizar · Michael Wick
Abstract:
Although substantial research focuses on improving robustness against jailbreak attacks in Large Language Models (LLMs), increased safety measures can result in \textit{over-refusal}, wherein LLMs inappropriately reject benign prompts, thereby diminishing their utility. Current jailbreak defense strategies predominantly aim to decrease jailbreak \textit{Attack Success Rate (ASR)} but do not typically investigate over-refusal. In this work, we propose model improvement as a defense mechanism, leveraging either the original model or an external LLM in various settings of zero-shot prompting and in-context learning. For comprehensive evaluation, we propose a framework inspired by binary classification and simultaneously assess various defense methodologies on standard over-refusal benchmarks. Our experimental results on state-of-the-art jailbreak attacks on Llama-2 models show that model improvement can significantly reduce ASR (e.g., from $46 \%$ to $0 \%$ on GCG attack) while minimizing degradation in general instruction-following performance. Furthermore, we identify alarmingly high over-refusal rates in prominent defense approaches, underscoring the need for future research into more effective and practical jailbreak defense solutions.
Chat is not available.