NeurIPS Emergence of Steganography Between Large Language Models

Poster
in
Workshop: Towards Safe & Trustworthy Agents

Emergence of Steganography Between Large Language Models

Yohan Mathew · Joan Velja · Ollie Matthews · Robert McCarthy · Dylan Cope · Nandi Schoots

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

Future AI systems may involve multiple AI agents with independent and potentially adversarial goals interacting with one another. In these settings, there is the risk that agents will learn to collude in order to increase their gains at the expense of other agents, and steganographic techniques are a powerful way to achieve such collusion undetected. Steganography is defined as the practice of concealing information within another message or physical object to communicate with a colluding party while avoiding detection by a third party. In this paper, we use a simplified candidate screening setting with two Large Language Models (LLMs). Here, a cover letter summarizing LLM has access to sensitive information that has historically been correlated with good candidates, but that it is not allowed to communicate to the decision-making LLM. We use two learning algorithms to optimize the LLMs to improve their performance on the candidate screening task -- In-Context Reinforcement Learning (ICRL) and Gradient-Based Reinforcement Learning (GBRL). We find that even though we do not directly prompt the models to do steganography, it emerges because it is instrumental for obtaining reward.

Chat is not available.

Poster in Workshop: Towards Safe & Trustworthy Agents

Emergence of Steganography Between Large Language Models

Yohan Mathew · Joan Velja · Ollie Matthews · Robert McCarthy · Dylan Cope · Nandi Schoots

Poster
in
Workshop: Towards Safe & Trustworthy Agents