Skip to yearly menu bar Skip to main content


Poster+Demo Session
in
Workshop: Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation

Text-to-Audio Generation via Bridging Audio Language Model and Latent Diffusion

ZHENYU WANG · Chenxing Li · YONG XU · Chunlei Zhang · John H. L. Hansen · Dong Yu

[ ] [ Project Page ]
Sat 14 Dec 4:15 p.m. PST — 5:30 p.m. PST

Abstract:

Diffusion models have become the foundation for most text-to-audio generation methods. These approaches rely on a large text encoder to process the textual description, serving as a semantic condition to guide the audio generation process. Meanwhile, autoregressive language model-based methods for audio generation have also emerged. These autoregressive models offer flexibility by predicting discrete audio tokens, but they often fail to achieve high fidelity. In this work, we propose an advanced system that integrates the autoregressive language model with the diffusion model, achieving flexible and refined audio generation. The auto-regressive language model is used to predict the discrete audio tokens conditioned on text prompts. Then, audio tokens are fed into the diffusion model to further purify the details of the generated audio. Consequently, compared to baseline systems, our proposed approach can deliver better results on most objective and subjective metrics on the AudioCaps test set. Audio demos generated by our proposed best system are available at https://dcldmdemo.github.io.

Chat is not available.