Skip to yearly menu bar Skip to main content


Poster+Demo Session
in
Workshop: Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation

One-shot Text-aligned Virtual Instrument Generation Utilizing Diffusion Transformer

Qihui Yang · Jiahe Lei · Qiuqiang Kong

[ ] [ Project Page ]
Sat 14 Dec 10:30 a.m. PST — noon PST

Abstract:

Despite the success of emerging text-to-music models based on deep generative approaches in generating music clips for general audiences, they face significant limitations when applied to professional music production. This paper introduces a one-shot Text-aligned Virtual Instrument Generation model using a Diffusion Transformer (TaVIG). The model integrates textual descriptions with the timbre information of audio clips to generate musical performances, utilizing additional musical structure features such as pitch, onset, duration, offset, and velocity. TaVIG comprises a CLAP-based text-aligned timbre extractor-encoder, a musical structure encoder for extracting MIDI information, and a disentangled representation learning module to ensure effective timbre and structure extraction. The audio synthesis process is based on a Diffusion Transformer conditioned with AdaLN. Additionally, we propose a mathematical framework to analyze timbre and structure disentanglement in MIDI-to-audio tasks.

Chat is not available.