Oral
in
Workshop: Workshop on Video-Language Models
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
Mu Cai · Reuben Tan · Jianrui Zhang · Bocheng Zou · Kai Zhang · Yao Feng · Fangrui Zhu · Jing Gu · Yiwu Zhong · Yuzhang Shang · Yao Dou · Jaden Park · Jianfeng Gao · Yong Jae Lee · Jianwei Yang
Understanding fine-grained temporal dynamics is crucial for video understanding. Yet, popular video benchmarks, such as MSRVTT and TGIF, often fail to effectively evaluate AI models' temporal reasoning abilities due to the lack of fine-grained temporal annotations. As a result, text-based models, leveraging strong language priors, often perform comparably to video models, and image-trained models have been reported to outperform their video-trained counterparts on MSRVTT and TGIF. This paper introduces a new TemporalBench benchmark for fine-grained temporal event understanding in videos. TemporalBench, sourced from a diverse video datasets, consists of ∼10K pairs of video description questions, derived from ∼2K high-quality human-annotated video captions. Uniquely, our benchmark provides fine-grained temporal annotations to evaluate models' temporal reasoning abilities. Our results show that state-of-the-art models like GPT-4o achieve only 38.0% multiple binary QA accuracy on TemporalBench, demonstrating a significant human-AI gap in temporal understanding. We hope that TemporalBench is instrumental to fostering research on improving models' temporal reasoning capabilities.