Skip to yearly menu bar Skip to main content


Poster+Demo Session
in
Workshop: Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation

Disentangling Multi-instrument Music Audio for Source-level Pitch and Timbre Manipulation

Yin-Jyun Luo · Kin Wai Cheuk · Woosung Choi · Wei-Hsiang Liao · Keisuke Toyama · Toshimitsu Uesaka · Koichi Saito · Chieh-Hsin Lai · Yuhta Takida · Simon Dixon · Yuki Mitsufuji

[ ] [ Project Page ]
Sat 14 Dec 4:15 p.m. PST — 5:30 p.m. PST

Abstract:

Disentangling pitch and timbre from the audio of a musical instrument involves encoding these two attributes as separate latent representations, allowing the synthesis of instrument sounds with novel attribute combinations by manipulating one representation independently of the other. Existing solutions have mostly focused on single-instrument audio, excluding the cases where multiple sources of instruments are presented. To fill the gap, we aim to disentangle multi-instrument mixtures by extracting per-instrument representation that combines the pitch and timbre latent variables. These latent variables construct a set of modular building blocks that is used to condition a decoder to compose new mixtures. We first present a simple implementation to verify the framework using structured and isolated chords. We then scale up to a complex dataset of four-part chorales by a model that jointly learns the latents and a diffusion transformer. Our evaluation identifies the key components for the success of disentanglement and demonstrates the application of mixture transformation based on source-level attribute manipulation. Audio samples are available at https://yjlolo.github.io/dismix-audio-samples.

Chat is not available.