Poster+Demo Session
in
Workshop: Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation
Efficient Generative Multimodal Integration (EGMI): Enabling Audio Generation from Text-Image Pairs through Alignment with Large Language Models
Taemin Kim · Wooyeol Baek · Heeseok Oh
Multimodal large language models (MLLM) face challenges in leveraging their rich knowledge since spanning different modalities is nontrivial and their contextual ambiguity arises from lack of paired data. In the context of audio generation based on MLLM, the annotation of audio-text paired datasets demands significant human resources due to the complexity of audio data, making such datasets much scarcer and harder to access compared to image-text paired datasets. To address these issues, we propose a novel technique called \textit{efficient generative multimodal integration (EGMI)}, which enables audio generation tasks leveraging only image-text data. Based on pretrained LLM's powerful knowledge on text comprehension, EGMI successfully leverages image-text paired datasets for cross-modal alignment, enabling interactions between audio and image information. We also introduce an efficient mapping network, called the EGMI mapper, and utilize it to attend to image information when generating audio data. Therefore, we have extended the limits of existing methods in terms of scalability and flexibility. Also, we have demonstrated that EGMI maximizes the interaction between cross-modal knowledge, improving alignment, and sample quality.