Skip to yearly menu bar Skip to main content


Poster+Demo Session
in
Workshop: Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation

Coarse-to-Fine Text-to-Music Latent Diffusion

Luca Lanzendörfer · Tongyu Lu · Nathanael Perraudin · Dorien Herremans · Roger Wattenhofer

[ ] [ Project Page ]
Sat 14 Dec 4:15 p.m. PST — 5:30 p.m. PST

Abstract:

We introduce DiscoDiff, a text-to-music generative model that utilizes two latent diffusion models to produce high-fidelity 44.1kHz music hierarchically. Our approach significantly enhances audio quality through a coarse-to-fine generation strategy, leveraging residual vector quantization from the Descript Audio Codec. We consolidate this coarse-to-fine design through an important observation that the audio latent representation can be splitted into primary and secondary part, controlling music contents and details accordingly. We validate the effectiveness of our approach and text-audio alignment through various objective metrics. Furthermore, we provide access to high-quality synthetic captions for MTG-Jamendo and FMA datasets, as well as open-sourcing DiscoDiff's codebase and model checkpoints.

Chat is not available.