Poster
in
Workshop: Machine Learning for Audio
MusT3: Unified Multi-Task Model for Fine-Grained Music Understanding
Martin Kukla · Minz Won · Yun-Ning Hung · Duc Le
Recent advances in sequence-to-sequence modelling enabled new powerful multi-task models in text, vision, and speech domains. This work attempts to leverage these advances for music. We propose MusT3: Music-To-Tags Transformer, a novel model for fine-grained music understanding. First, we design the unified music-to-tags form, which enables us to cast any music understanding task as sequence prediction problem. Second, we utilize Transformer-based model to predict that sequence given music representation. Third, we leverage multi-task learning framework to train a single model for many tasks. We validate our approach on four tasks: beat tracking, chord recognition, key detection, and vocal melody extraction. Our model performs significantly better than the current state-of-the-art models on two of these tasks, while staying competitive on the remaining two. Finally, in controlled experiment, we demonstrate that our model can reuse knowledge between tasks, leading to better performance on low-resource tasks with limited training data.