Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Multimodal Algorithmic Reasoning Workshop

LVM-Net: Efficient Long-Form Video Reasoning

Saket Gurukar · Asim Kadav

[ ]
[ Slides
Sun 15 Dec 2:15 p.m. PST — 4:15 p.m. PST

Abstract:

Long-form video reasoning has become increasingly important for various applications such as video retrieval and question answering. Approaches to handling long videos can be computationally inefficient, especially when using large vision-language models (VLMs) that require significant computational resources. Existing efficient methods such as using a fixed number of frames often suffer from limitations in accuracy due to the loss of discriminative information or the order of multiple short-term actions. This paper proposes a novel method, Long-Video Memory Network, LVM-Net, which uses differentiable neural sampling to efficiently populate a memory with discriminative tokens from long-form videos. By using this approach for reasoning over objects, activities and time using the Rest-ADL dataset, we demonstrate a significant 18x improvement in inference speed while maintaining competitive predictive performance.

Chat is not available.