Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Workshop on Video-Language Models

LLAVIDAL: Benchmarking Large Language Vision Models for Daily Activities of Living

Rajatsubhra Chakraborty · Arkaprava Sinha · Dominick Reilly · Manish Govind · Pu Wang · francois bremond · Srijan Das


Abstract:

With the increasing pervasiveness of video content throughout society, the demand for robust video-language models is increasingly urgent. In this work we introduce LLAVIDAL, a Large Language Vision Model tailored for Activities of Daily Living (ADL). Unlike existing models primarily trained on curated web videos, LLAVIDAL leverages a novel multiview RGB-D dataset, ADL-X, which includes 100K untrimmed video-instruction pairs, enriched with 3D skeletons and object trajectories to mimic real-world complexities. The model integrates these features to effectively understand intricate human behaviors and spatiotemporal dynamics typical of daily activities. We also introduce ADLMCQ, a new benchmark designed to evaluate the proficiency of video-language models in interpreting ADL content. Our evaluations demonstrate that LLAVIDAL significantly outperforms existing models, showcasing superior ability to process and reason about real-life video scenarios. The insights gained underscore the necessity for advanced processing techniques to handle the scale and multimodality of video data, alongside a need for comprehensive benchmarks that reflect real-world use cases more accurately.

Chat is not available.