Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Workshop on Video-Language Models

MuMA-ToM: Multi-modal Multi-Agent Theory of Mind

Haojun Shi · Suyu Ye · Xinyu Fang · Chuanyang Jin · Leyla Isik · Yen-Ling Kuo · Tianmin Shu


Abstract:

Understanding people’s social interactions in complex real-world scenarios often relies on intricate mental reasoning.To truly understand how and why people interact with oneanother, we must infer the underlying mental states that giverise to the social interactions, i.e., Theory of Mind reason-ing in multi-agent interactions. Additionally, social interac-tions are often multi-modal – we can watch people’s actions,hear their conversations, and/or read about their past behav-iors. For AI systems to successfully and safely interact withpeople in real-world environments, they also need to under-stand people’s mental states as well as their inferences abouteach other’s mental states based on multi-modal informationabout their interactions. For this, we introduce MuMA-ToM,a Multi-modal Multi-Agent Theory of Mind benchmark.MuMA-ToM is the first multi-modal Theory of Mind bench-mark that evaluates mental reasoning in embodied multi-agent interactions. In MuMA-ToM, we provide video andtext descriptions of people’s multi-modal behavior in realis-tic household environments. Based on the context, we thenask questions about people’s goals, beliefs, and beliefs aboutothers’ goals. We validated MuMA-ToM in a human ex-periment and provided a human baseline. We also proposeda novel multi-modal, multi-agent ToM model, LIMP (Lan-guage model-based Inverse Multi-agent Planning). Our ex-perimental results show that LIMP significantly outperformsstate-of-the-art methods, including large multi-modal mod-els (e.g., GPT-4o, Gemini-1.5 Pro) and a recent multi-modalToM model, BIP-ALM.

Chat is not available.