Poster
in
Workshop: Towards Safe & Trustworthy Agents
Modelling the oversight of deceptive interpretability agents
Simon Lermen · Mateusz Dziemian
This study explores how AI agents can cooperate to deceive oversight systems in automated interpretability. We use modified Llama 3 language models as toy models representing deceptive, unaligned AI agents.Our study focuses on sparse autoencoders (SAE), a promising interpretability approach that uses language models to label and score neural network features. In this approach, one model labels features with explanations based on activations, another simulates activations from labels, and the comparison yields an explanation score.We demonstrate how agents can coordinate to create deceptive labels. These explanation labels evade overseer detection while maintaining high explanation scores. We employ refusal-vector ablation, custom prompting for deception, agentic scaffolding, and steganography tools.Our agents successfully completed the deception task in 87% of cases without detection by the overseer. Despite the deception, our method achieved explanation scores of 0.82, compared to 0.96 for ground truth labels on a possible scale of 0 to 1.We conclude by proposing mitigation strategies, emphasizing the critical need for robust defenses against deception.