NeurIPS Towards Safe and Honest AI Agents with Neural Self-Other Overlap

Oral
in
Workshop: Safe Generative AI

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

Marc Carauleanu · Michael Vaiana · Diogo de Lucena · Judd Rosenblatt · Cameron Berg

[ Abstract ] [ Project Page ]

[ OpenReview]

presentation: Safe Generative AI
Sun 15 Dec 9 a.m. PST — 5 p.m. PST

Abstract:

As AI systems increasingly make critical decisions, deceptive AI poses a significant challenge to trust and safety. We present Self-Other Overlap (SOO) fine-tuning, a promising approach in AI Safety that could substantially improve our ability to build honest artificial intelligence. Inspired by cognitive neuroscience research on empathy, SOO aims to align how AI models represent themselves and others. Our experiments with Mistral 7B v0.2 demonstrate SOO's efficacy: deceptive responses in this large language model dropped from 95.2% to 15.9% with no observed reduction in general task performance, while in reinforcement learning scenarios, SOO-trained agents showed significantly reduced deceptive behavior. SOO's focus on internal representations offers strong potential for generalization across AI architectures. While current applications focus on language models and simple RL environments, SOO could pave the way for more trustworthy AI in broader domains. Ethical implications and long-term effects warrant further investigation, but SOO represents a significant step forward in AI safety research.

Chat is not available.

Oral in Workshop: Safe Generative AI

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

Marc Carauleanu · Michael Vaiana · Diogo de Lucena · Judd Rosenblatt · Cameron Berg

Oral
in
Workshop: Safe Generative AI