Skip to yearly menu bar Skip to main content


Poster

Video Token Merging for Long Video Understanding

Seon-Ho Lee · Jue Wang · Zhikang Zhang · David Fan · Xinyu Li

[ ]
Thu 12 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

As data and model scales for video understanding rapidly expand, handling long-form video input in transformer-based models presents a practical challenge. Rather than resorting to input sampling or token dropping, which may result in information loss, token merging shows promise in collaboration with transformers. However, implementing token merging for long-form video processing is not trivial. We begin with the premise that token merging should not solely rely on the similarity of video tokens; the saliency of tokens also warrants consideration. To address this, we explore various video token merging strategies in the context of long-form video classification: from a simple extension of image token merging to region-concentrated merging and finally propose a learnable Video Token Merging (VTM) algorithm that dynamically merges video tokens based on visual salient areas. Through extensive experimentation, we achieve state-of-the-art or comparable performance on LVU, COIN, and Breakfast datasets. Additionally, our approach significantly reduces memory costs by 84% and boosts throughput by approximately 6.89 times compared to baseline algorithms.

Live content is unavailable. Log in and register to view live content