Poster
Tell What You Hear From What You See - Video to Audio Generation Through Text
Xiulong Liu · Kun Su · Eli Shlizerman
East Exhibit Hall A-C #4807
The content of visual and audio scenes is multi-faceted such that a video stream canbe paired with various audio streams and vice-versa. Thereby, in video-to-audiogeneration task, it is imperative to introduce steering approaches for controlling thegenerated audio. While Video-to-Audio generation is a well-established generativetask, existing methods lack such controllability. In this work, we propose VATT, amulti-modal generative framework that takes a video and an optional text promptas input, and generates audio and optional textual description (caption) of theaudio. Such a framework has two unique advantages: i) Video-to-Audio generationprocess can be refined and controlled via text which complements the contextof the visual information, and ii) The model can suggest what audio to generatefor the video by generating audio captions. VATT consists of two key modules:VATT Converter, which is an LLM that has been fine-tuned for instructions andincludes a projection layer that maps video features to the LLM vector space, andVATT Audio, a bi-directional transformer that generates audio tokens from visualframes and from optional text prompt using iterative parallel decoding. The audiotokens and the text prompt are used by a pretrained neural codec to convert theminto a waveform. Our experiments show that when VATT is compared to existingvideo-to-audio generation methods in objective metrics, such as VGGSound audiovisual dataset, it achieves competitive performance when the audio caption isnot provided. When the audio caption is provided as a prompt, VATT achieveseven more refined performance (with lowest KLD score of 1.41). Furthermore,subjective studies asking participants to choose the most compatible generatedaudio for a given silent video, show that VATT Audio has been chosen on averageas a preferred generated audio than the audio generated by existing methods. VATTenables controllable video-to-audio generation through text as well as suggestingtext prompts for videos through audio captions, unlocking novel applications suchas text-guided video-to-audio generation and video-to-audio captioning.