Poster
AgentPoison: Red-teaming LLM Agents via Memory or Knowledge Base Backdoor Poisoning
Zhaorun Chen · Zhen Xiang · Chaowei Xiao · Dawn Song · Bo Li
Abstract:
LLM agents have demonstrated remarkable performance across various applications, primarily due to their advanced capabilities in reasoning, utilizing external knowledge and tools, calling APIs, and executing actions to interact with environments. Current agents typically utilize a _memory module_ or a retrieval-augmented generation (RAG) mechanism, retrieving past knowledge and instances with similar embeddings from _knowledge bases_ to inform task planning and execution. However, the reliance on unverified knowledge bases raises significant concerns about their safety and trustworthiness. To uncover such vulnerabilities, we propose a novel red teaming approach **AgentPoison**, the first backdoor attack targeting generic and RAG-based LLM agents by poisoning their long-term memory or RAG knowledge base. In particular, we form the trigger generation process as a constrained optimization to optimize backdoor triggers by mapping the triggered instances to a unique embedding space, so as to ensure that whenever a user instruction contains the optimized backdoor trigger, the malicious demonstrations are retrieved from the poisoned memory or knowledge base with high probability. In the meantime, benign instructions without the trigger will still maintain normal performance. Unlike conventional backdoor attacks, **AgentPoison** requires no additional model training or fine-tuning, and the optimized backdoor trigger exhibits superior transferability, in-context coherence, and stealthiness. Extensive experiments demonstrate **AgentPoison**'s effectiveness in attacking three types of real-world LLM agents: RAG-based autonomous driving agent, knowledge-intensive QA agent, and healthcare EHRAgent. We inject the poisoning instances into the RAG knowledge base and long-term memories of these agents, respectively, demonstrating the generalization of **AgentPoison**. On each agent, **AgentPoison** achieves an average attack success rate of $\geq 88$% with minimal impact on benign performance ($\leq 1$%) with a poison rate $\le 1$%.
Live content is unavailable. Log in and register to view live content