Skip to yearly menu bar Skip to main content


Oral
in
Workshop: Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation

BLAP: Bootstrapping Language-Audio Pre-training for Music Captioning

Luca Lanzendörfer · Constantin Pinkl · Nathanael Perraudin · Roger Wattenhofer

[ ] [ Project Page ]
Sat 14 Dec 1:45 p.m. PST — 2 p.m. PST

Abstract:

We introduce BLAP, a model capable of generating high-quality captions for music. BLAP is based on the BLIP-2 architecture, leveraging a fine-tuned CLAP audio encoder and a pre-trained Flan-T5 large language model. To achieve effective cross-modal alignment between music and language, BLAP utilizes a Querying Transformer, allowing us to obtain state-of-the-art performance using 6x less data compared to previous models.We provide qualitative examples demonstrating BLAP's ability to produce realistic captions for music, and perform a quantitative evaluation on three datasets.BLAP achieves a relative improvement on FENSE compared to previous models of 3.5\%, 6.5\%, and 7.5\% on the MusicCaps, Song Describer, and YouTube8m-MTC datasets, respectively. We open-source the code and model weights at https://github.com/ETH-DISCO/blap.

Chat is not available.