Spotlight
in
Workshop: Multi-Agent Security: Security as Key to AI Safety
Second-order Jailbreaks: Generative Agents Successfully Manipulate Through an Intermediary
Mikhail Terekhov · Romain Graux · Eduardo Neville · Denis Rosset · Gabin Kolly
Keywords: [ negotiations ] [ multi-agent ] [ Large language models ] [ security ]
As the capabilities of Large Language Models (LLMs) continue to expand, their application in communication tasks is becoming increasingly prevalent. However, this widespread use brings with it novel risks, including the susceptibility of LLMs to "jailbreaking" techniques. In this paper, we explore the potential for such risks in two- and three-agent communication networks, where one agent is tasked with protecting a password while another attempts to uncover it. Our findings reveal that an attacker, powered by advanced LLMs, can extract the password even through an intermediary that is instructed to prevent this. Our contributions include an experimental setup for evaluating the persuasiveness of LLMs, a demonstration of LLMs' ability to manipulate each other into revealing protected information, and a comprehensive analysis of this manipulative behavior. Our results underscore the need for further investigation into the safety and security of LLMs in communication networks.