Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Multimodal Algorithmic Reasoning Workshop

LLAVIDAL: Benchmarking Large LAnguage VIsion Models for Daily Activities of Living

Rajatsubhra Chakraborty · Arkaprava Sinha · Dominick Reilly · Manish Kumar Govind · Pu Wang · francois bremond · Srijan Das

[ ]
Sun 15 Dec 2:15 p.m. PST — 4:15 p.m. PST

Abstract:

With the increasing pervasiveness of video content throughout society, the demand for robust video-language models is increasingly urgent. In this work we introduce LLAVIDAL, a Large Language Vision Model tailored for Activities of Daily Living (ADL). Unlike existing models primarily trained on curated web videos, LLAVIDAL leverages a novel multiview RGB-D dataset, ADL-X, which includes 100K untrimmed video-instruction pairs, enriched with 3D skeletons and object trajectories to mimic real-world complexities. The model integrates these features to effectively understand intricate human behaviors and spatiotemporal dynamics typical of daily activities. We also introduce ADLMCQ, a new benchmark designed to evaluate the proficiency of video-language models in interpreting ADL content. Our evaluations demonstrate that LLAVIDAL significantly outperforms existing models, showcasing superior ability to process and reason about real-life video scenarios. The insights gained underscore the necessity for advanced processing techniques to handle the scale and multimodality of video data, alongside a need for comprehensive benchmarks that reflect real-world use cases more accurately. The instruction tuning data is available at https://adl-x.github.io

Chat is not available.