NeurIPS HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions

Poster
in
Workshop: Towards Safe & Trustworthy Agents

HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions

Xuhui Zhou · Hyunwoo Kim · Faeze Brahman · Liwei Jiang · Hao Zhu · Ximing Lu · Frank F. Xu · Bill Yuchen Lin · Niloofar Mireshghallah · Ronan Le Bras · Maarten Sap

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

AI agents are increasingly autonomous in their interactions with their environment and users, leading to increased interactional safety risks. We present HAICOSYSTEM, a framework examining AI agent safety within diverse and complex social interactions. HAICOSYSTEM is a modular sandbox environment that simulates interactions between human users and AI agents equipped with a variety of tools (e.g., email and online payment platform), and examines the safety risks of AI agents in various situations (e.g., a smart-home AI agent opens the door for strangers). Additionally, we develop a comprehensive evaluation framework for AI agent safety, using a set of metrics that cover operational, content-related, societal, and legal risks. Through running the 1840 simulation based on 92 scenarios ranging from various domains (e.g., medical), we demonstrate that HAICOSYSTEM can emulate realistic user-AI interactions and complex tool use by AI agents. Our experiments show that state-of-the-art large language models (LLMs) exhibit safety risks in over 50% of cases, with models generally showing higher risks when interacting with malicious human users. Our findings highlight the ongoing challenge of building agents that can safely navigate complex interactions, particularly when faced with malicious users. To foster the AI agent safety ecosystem, we release a code platform that allows practitioners to create custom scenarios, simulate interactions, and evaluate the safety and performance of their agents.

Chat is not available.

Poster in Workshop: Towards Safe & Trustworthy Agents

HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions

Xuhui Zhou · Hyunwoo Kim · Faeze Brahman · Liwei Jiang · Hao Zhu · Ximing Lu · Frank F. Xu · Bill Yuchen Lin · Niloofar Mireshghallah · Ronan Le Bras · Maarten Sap

Poster
in
Workshop: Towards Safe & Trustworthy Agents