Skip to yearly menu bar Skip to main content


Poster

HourVideo: 1-Hour Video-Language Understanding

Keshigeyan Chandrasegaran · Agrim Gupta · Manling Li · Taran Kota · Lea M. Hadzic · Jimming He · Cristobal Eyzaguirre · Zane Durante · Jiajun Wu · Li Fei-Fei

West Ballroom A-D #5109
[ ] [ Project Page ]
[ Poster
Thu 12 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract:

We present HourVideo, a benchmark dataset for one hour video-language understanding. Our dataset consists of a novel task suite comprising summarization, perception (recall, tracking), visual reasoning (spatial, temporal, predictive, causal, counterfactual), and navigation (room-to-room, object retrieval) tasks. The benchmark includes 500 egocentric videos from the Ego4D dataset, spanning durations from 20 to 120 minutes, and features 13,000 high-quality five-way multiple-choice questions. Initial benchmarking results show that multimodal models like GPT-4V and LLaVA-NeXT perform only marginally above random chance. In contrast, human baselines significantly outperform the state-of-the-art long-context multimodal model Gemini Pro 1.5 (84% vs. 40%), suggesting substantial research gap. Our benchmark, evaluation toolkit, baseline results, prompts, and documentation are included in the Supplementary materials and will be made publicly available.

Chat is not available.