Oral
in
Workshop: Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation
BLAP: Bootstrapping Language-Audio Pre-training for Music Captioning
Luca Lanzendörfer · Constantin Pinkl · Nathanael Perraudin · Roger Wattenhofer
We introduce BLAP, a model capable of generating high-quality captions for music. BLAP is based on the BLIP-2 architecture, leveraging a fine-tuned CLAP audio encoder and a pre-trained Flan-T5 large language model. To achieve effective cross-modal alignment between music and language, BLAP utilizes a Querying Transformer, allowing us to obtain state-of-the-art performance using 6x less data compared to previous models.We provide qualitative examples demonstrating BLAP's ability to produce realistic captions for music, and perform a quantitative evaluation on three datasets.BLAP achieves a relative improvement on FENSE compared to previous models of 3.5\%, 6.5\%, and 7.5\% on the MusicCaps, Song Describer, and YouTube8m-MTC datasets, respectively. We open-source the code and model weights at https://github.com/ETH-DISCO/blap.