Poster
in
Workshop: Workshop on Video-Language Models
Can Video Large Language Models Comprehend Language in Videos?
Minjoon Jung · Junbin Xiao · Byoung-Tak Zhang · Angela Yao
Recent advancements in video large language models (Video-LLMs) have shown capabilities of temporally-grounding language queries or retrieving video moments in videos. However, such capabilities have not been thoroughly verified to be robust and trustable. In this study, we explore the consistency of Video-LLMs in grasping temporal moments within videos — a critical indicator for robust and trustworthy video language comprehension. Specifically, we devise different probes where Video-LLMs first predict temporal moments based on language queries, followed by verification questions to assess whether the predicted moments accurately reflect the queries. Our results show that current Video-LLMs respond unintuitively to such assessment; they often fail to provide consistent answers upon re-evaluation and even get near chance-level performance. This reveals the significant shortcomings in the current capabilities of Video-LLMs for reliable video temporal understanding, underscoring the need for further research and development in this field.