NeurIPS Universal Jailbreak Backdoors in Large Language Model Alignment

Poster
in
Workshop: Safe Generative AI

Universal Jailbreak Backdoors in Large Language Model Alignment

Thomas Baumann

[ Abstract ] [ Project Page ]

[ Poster] [ OpenReview]

Abstract:

Aligning large language models is essential to obtain models that generate help-ful and harmless responses. However, it has been shown that these models areprone to jailbreaking attacks by reverting them to their unaligned state via adver-sarial prompt engineering or poisoning of the alignment process. Prior work hasintroduced a ”universal jailbreak backdoor” attack, in which an attacker poisonsthe training data used for reinforcement learning from human feedback (RLHF).This work further explores the universal jailbreak backdoor attack, by applyingit to other alignment techniques, namely direct preference optimization (DPO),identity preference optimization (IPO), Kahneman-Tversky optimization (KTO)and odds ratio preference optimization (ORPO). We compare our findings withprevious results and question the robustness of the named algorithms.

Chat is not available.

Poster in Workshop: Safe Generative AI

Universal Jailbreak Backdoors in Large Language Model Alignment

Thomas Baumann

Poster
in
Workshop: Safe Generative AI