Poster
in
Workshop: Machine Learning for Audio
Composing and Validating Large-Scale Datasets for Training Open Foundation Models for Audio
Marianna Nezhurina · Ke Chen · Yusong Wu · Tianyu Zhang · Haohe Liu · Yuchen Hui · Taylor Berg-Kirkpatrick · Shlomo Dubnov · Jenia Jitsev
Obtaining strong reproducible foundation language-audio models require open datasets of sufficient scale and quality. To pre-train contrastive language-audio model we compose large-scale sound effects dataset with detailed text descriptions for each sample. Generating music, as a special type of audio, presents further challenges due to limited availability of music-text pairs with expressive enough captions. We show here how we combine various composed datasets to pre-train a large-scale audio-language contrastive model (CLAP). Then we train, on music samples we collected, a state-of-the-art text-to-music model, MusicLDM, that adapts AudioLDM, which is based on Stable Diffusion architecture, to the music domain, by utilizing pre-trained CLAP model and the Hifi-GAN vocoder, as components of MusicLDM. The modelling work validates thus composed text-audio and text-music datasets as strong basis for further studies on language-rooted foundation models for audio at larger scales.