NeurIPS High Fidelity Text-Guided Music Editing via Single-Stage Flow Matching

Poster+Demo Session
in
Workshop: Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation

High Fidelity Text-Guided Music Editing via Single-Stage Flow Matching

Gael Le Lan · Bowen Shi · Zhaoheng Ni · Sidd Srinivasan · Anurag Kumar · Brian Ellis · David Kant · Varun Nagaraja · Ernie Chang · Wei-Ning Hsu · Yangyang Shi · Vikas Chandra

[ Abstract ] [ Project Page ]

[ OpenReview]

Sat 14 Dec 10:30 a.m. PST — noon PST

Abstract:

We introduce MelodyFlow, an efficient text-controllable high-fidelity music generation and editing model. It operates on sequences of continuous latent representations from a low frame rate 48 kHz stereo variational auto encoder codec. Based on a diffusion transformer architecture trained on a flow-matching objective the model can edit diverse high quality stereo samples of variable duration, with simple text descriptions. We adapt the ReNoise latent inversion method to flow matching and compare it with the naive denoising diffusion implicit model (DDIM) inversion on a variety of music editing prompts. Our results indicate that the regularized latent inversion outperforms DDIM for zero-shot test-time text-guided editing on several objective metrics. Subjective evaluations exhibit comparable performance between both methods, showing a noticeable improvement over previous state of the art for music editing. Code and model weights will be publicly made available. Samples are available at https://melodyflow.github.io.

Chat is not available.

Poster+Demo Session in Workshop: Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation

High Fidelity Text-Guided Music Editing via Single-Stage Flow Matching

Gael Le Lan · Bowen Shi · Zhaoheng Ni · Sidd Srinivasan · Anurag Kumar · Brian Ellis · David Kant · Varun Nagaraja · Ernie Chang · Wei-Ning Hsu · Yangyang Shi · Vikas Chandra

Poster+Demo Session
in
Workshop: Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation