NeurIPS Poster $E^3$: Exploring Embodied Emotion Through A Large-Scale Egocentric Video Dataset

Poster

$E^3$: Exploring Embodied Emotion Through A Large-Scale Egocentric Video Dataset

wang lin · Yueying Feng · WenKang Han · Tao Jin · Zhou Zhao · Fei Wu · Chang Yao · Jingyuan Chen

West Ballroom A-D #6602

[ Abstract ] [ Project Page ]

[ Paper]

Fri 13 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract: Understanding human emotions is fundamental to enhancing human-computer interaction, especially for embodied agents that mimic human behavior. Traditional emotion analysis often takes a third-person perspective, limiting the ability of agents to interact naturally and empathetically. To address this gap, this paper presents $E^3$ for Exploring Embodied Emotion, the first massive first-person view video dataset. $E^3$ contains more than $50$ hours of video, capturing $8$ different emotion types in diverse scenarios and languages. The dataset features videos recorded by individuals in their daily lives, capturing a wide range of real-world emotions conveyed through visual, acoustic, and textual modalities. By leveraging this dataset, we define $4$ core benchmark tasks - emotion recognition, emotion classification, emotion localization, and emotion reasoning - supported by more than $80$k manually crafted annotations, providing a comprehensive resource for training and evaluating emotion analysis models. We further present Emotion-LlaMa, which complements visual modality with acoustic modality to enhance the understanding of emotion in first-person videos. The results of comparison experiments with a large number of baselines demonstrate the superiority of Emotion-LlaMa and set a new benchmark for embodied emotion analysis. We expect that $E^3$ can promote advances in multimodal understanding, robotics, and augmented reality, and provide a solid foundation for the development of more empathetic and context-aware embodied agents.

Chat is not available.