Poster
in
Workshop: 5th Workshop on Self-Supervised Learning: Theory and Practice
Leveraging Audio and Visual Recurrence for Unsupervised Video Highlight Detection
Md Zahidul Islam · Sujoy Paul · Mrigank Rochan
With the exponential growth of video content, the need for automated methods to extract key moments or highlights from lengthy videos has become increasingly pressing. Existing methods typically require expensive manually labeled annotations or a large external dataset for weak supervision. Hence, we propose a novel unsupervised approach which capitalizes on the premise that significant moments tend to recur across multiple videos of the similar category in both audio and visual modalities. Surprisingly, audio remains under-explored, especially in unsupervised algorithms, despite its potential to detect key moments. Our approach first groups videos into pseudo-categories using a clustering technique. Then, by measuring clip-level feature similarities across all videos within each pseudo-category for both audio and visual modalities, we obtain audio and visual pseudo-highlight scores, respectively. We combine these scores to create audio-visual pseudo ground-truth highlights for each video, which we subsequently use to train an audio-visual highlight detection network. Extensive experiments and ablation studies on three benchmarks show the superior performance of our method compared to prior work.