Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Generative AI and Creativity: A dialogue between machine learning researchers and creative professionals

PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing

Phillip Long · Zachary Novack · Taylor Berg-Kirkpatrick · Julian McAuley

[ ] [ Project Page ]
[ Slides [ Poster
Sat 14 Dec 1 p.m. PST — 2 p.m. PST

Abstract:

The recent explosion of generative AI-Music systems has raised numerous concerns over data copyright, licensing music from musicians, and the conflict between open-source AI and large prestige companies. As these issues highlight the need for publicly available, copyright-free musical data, we present PDMX: a large-scale open-source dataset of over 250K public domain MusicXML scores collected from the score-sharing forum MuseScore, making it the largest available copyright-free symbolic music dataset to our knowledge. PDMX additionally includes a wealth of both tag and user interaction metadata, allowing us to efficiently analyze the dataset and filter for high quality user-generated scores. We conduct multitrack music generation experiments evaluating how different representative subsets of PDMX lead to different behaviors in downstream models, and how user-rating statistics can be used as an effective measure of data quality.

Chat is not available.