Poster+Demo Session
in
Workshop: Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation
Towards Temporally Synchronized Visually Indicated Sounds Through Scale-Adapted Positional Embeddings
Xinhao Mei · Gael Le Lan · Haohe Liu · Zhaoheng Ni · Varun Nagaraja · Anurag Kumar · Yangyang Shi · Vikas Chandra
The task of video-to-audio (V2A) generation focuses on producing audio clips that are semantically aligned and temporally synchronized with silent video inputs. Despite recent progress, achieving precise audio-visual synchronization remains a significant challenge. Existing methods often rely on onset detection models, post-ranking or contrastive audio-visual pretraining to improve synchronization, overlooking the critical role of positional embeddings. In this work, we argue that positional embeddings are key to achieve accurate synchronization. Given the strict temporal correspondence between video and audio signals, we present two key arguments: first, visual features and audio tokens should employ identical positional embeddings to enhance temporal correspondence; second, the scale difference between visual features and audio tokens introduces alignment difficulties that negatively affect cross-modal alignment. To address these issues, we propose scale-adapted positional embeddings (SAPE) which are designed to account for discrepancies in sequence lengths and scales between visual features and continuous audio tokens. Experiments on the Greatest Hits dataset show that SAPE significantly improves audio-visual synchronization, achieving a state-of-the-art onset accuracy of 65.8\%.