Poster
in
Workshop: Safe Generative AI
Universal Jailbreak Backdoors in Large Language Model Alignment
Thomas Baumann
Aligning large language models is essential to obtain models that generate help-ful and harmless responses. However, it has been shown that these models areprone to jailbreaking attacks by reverting them to their unaligned state via adver-sarial prompt engineering or poisoning of the alignment process. Prior work hasintroduced a ”universal jailbreak backdoor” attack, in which an attacker poisonsthe training data used for reinforcement learning from human feedback (RLHF).This work further explores the universal jailbreak backdoor attack, by applyingit to other alignment techniques, namely direct preference optimization (DPO),identity preference optimization (IPO), Kahneman-Tversky optimization (KTO)and odds ratio preference optimization (ORPO). We compare our findings withprevious results and question the robustness of the named algorithms.