Poster
in
Workshop: Multimodal Algorithmic Reasoning Workshop
LVM-Net: Efficient Long-Form Video Reasoning
Saket Gurukar · Asim Kadav
Long-form video reasoning has become increasingly important for various applications such as video retrieval and question answering. Approaches to handling long videos can be computationally inefficient, especially when using large vision-language models (VLMs) that require significant computational resources. Existing efficient methods such as using a fixed number of frames often suffer from limitations in accuracy due to the loss of discriminative information or the order of multiple short-term actions. This paper proposes a novel method, Long-Video Memory Network, LVM-Net, which uses differentiable neural sampling to efficiently populate a memory with discriminative tokens from long-form videos. By using this approach for reasoning over objects, activities and time using the Rest-ADL dataset, we demonstrate a significant 18x improvement in inference speed while maintaining competitive predictive performance.