Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Workshop on Video-Language Models

Quo Vadis, Video Understanding with Vision-Language Foundation Models?

Mahmoud ALI · Di Yang · Arkaprava Sinha · Dominick Reilly · Srijan Das · Gianpiero Francesca · francois bremond


Abstract:

Vision-Language foundation models, including vision-language models (VLMs) and vision-large language models (VLLMs), have been evolving rapidly and have shown good performance on different downstream video understanding tasks, especially on web datasets. However, it is still an open question how much these VLMs and VLLMs perform in more challenging scenarios like Activities of Daily Living (ADL). To answer this, we provide a comprehensive study of VLMs and VLLMs by comparing their zero-shot transfer ability to five downstream tasks including action classification, video retrieval, video description, action forecasting, and frame-wise action segmentation. Extensive experiments are conducted on eleven real-world, human-centric video understanding datasets (e.g., Toyota Smarthome, Penn Action, UAV-Human, EgoExo4D, TSU, Charades) to study these tasks with our insights into the strengths and limitations of these models in zero-shot settings. Moreover, we provide in-deep analysis to find the best setting to improve the model performance in zero-shot action classification tasks. Based on our experiments, we find that these models are still far away from satisfactory performance in all evaluated tasks, particularly in densely labeled and long video datasets.

Chat is not available.