Poster
in
Workshop: 3rd Workshop on New Frontiers in Adversarial Machine Learning (AdvML-Frontiers)
The Ultimate Cookbook for Invisible Poison: Crafting Subtle Clean-Label Text Backdoors with Style Attributes
Wencong You · Daniel Lowd
Keywords: [ backdoor attacks ] [ Large language models ] [ Natural Language Processing ] [ adversarial machine learning ]
Backdoor attacks on text classifiers cause them to predict a predefined label when a particular "trigger" is present. Yet prior attacks often rely on triggers that are ungrammatical or otherwise unusual. In practice, human annotators, who play a critical role in curating training data, can easily detect and filter out these unnatural texts during manual inspection, reducing the risk of such attacks. We demonstrate that backdoor attacks can bypass detection by being subtle and appearing natural even upon close inspection, while still remaining effective. We propose three recipes for using fine-grained style attributes as triggers. Following prior methods, the triggers are added to texts through style transfer. However, our recipes provide a wide range of more subtle triggers, and we use human annotation to directly evaluate their subtlety and invisibility. Our evaluations show that our attack consistently outperforms the baselines and that our human annotation provides information not captured by automated metrics used in prior work.