Poster
in
Workshop: Red Teaming GenAI: What Can We Learn from Adversaries?
Infecting LLM Agents via Generalizable Adversarial Attack
Weichen Yu · Kai Hu · Tianyu Pang · Chao Du · Min Lin · Matt Fredrikson
Keywords: [ jailbreak ] [ LLM safety ] [ LLM agents ]
Sun 15 Dec 9 a.m. PST — 5:30 p.m. PST
LLM-powered agents augmented with memory, retrieval, and the ability to call external tools have demonstrated significant potential in augmenting human productivity. However, the fact that these models are vulnerable to adversarial attacks and other forms of "jailbreaking" raises concerns about safety and misuse, particularly when agents are granted autonomy. We initiate the study of these vulnerabilities in multi-agent, multi-round settings, where a collection of LLM-powered agents repeatedly exchange messages to complete a task. Focusing on the case where a single agent is initially exposed to an adversarial input, we aim to understand when this can lead to the eventual compromise of all agents in the collection via transmission of adversarial strings in subsequent messages. We show that this requires the ability to find an initial self-propagating input that will induce agents to repeat it with high probability relative to the contents of their memory---i.e., one that generalizes well across contexts. We propose a new attack called Generalizable Infectious Gradient Attack (GIGA), and show that it is successful across varied experimental settings that aim to 1) propagate an attack suffix across large collections of models, and 2) bypass a prompt-rewriting defense for adversarial examples, whereas existing attack methods often struggle to identify such inputs.