Poster
in
Workshop: Workshop on robustness of zero/few-shot learning in foundation models (R0-FoMo)
Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game
Sam Toyer · Olivia Watkins · Ethan Mendes · Justin Svegliato · Luke Bailey · Tiffany Wang · Isaac Ong · Karim Elmaaroufi · Pieter Abbeel · Trevor Darrell · Alan Ritter · Stuart J Russell
While Large Language Models (LLMs) are increasingly being used in real-world applications, they remain vulnerable to prompt injection attacks: malicious third party prompts that subvert the intent of the system designer. We present a dataset of over 126,808 prompt injection attacks and 46,457 anti-injection "defense'' prompts to elucidate this problem, created by players of an online game called Tensor Trust. To the best of our knowledge, this is the largest dataset of human-generated adversarial examples for instruction-following LLMs. We demonstrate that these attacks often have a simple structure that sheds light on the weaknesses of LLMs. We also use the dataset to create a benchmark for resistance to two types of prompt injection, which we refer to as prompt extraction and prompt hijacking. Our benchmark results show that many models are vulnerable to the attack strategies in the Tensor Trust dataset. Furthermore, our small-scale experiments on deployed LLM-based applications show that attack strategies in the dataset generalize beyond the setting of the game. We release all data and source code.