Poster
in
Workshop: 5th Workshop on Self-Supervised Learning: Theory and Practice
Explainable Audio-Visual Representation Learning via Prototypical Contrastive Masked Autoencoder
Yi Li · Plamen P Angelov
Abstract:
In this paper, we propose a self-supervised prototypical contrastive audio-visual masked autoencoder (PCAV-MAE) to learn a joint and coordinated audio-visual representation. Different from conventional techniques, we calculate prototypes as latent variables and reconstruct the masked tokens by encouraging them to be closer to their assigned prototypes with contrastive learning. This design not only allows us to learn a joint representation but also helps to learn the intrinsic semantic information of videos. We demonstrate the transferability of our representations, achieving state-of-the-art audio-visual results in downstream tasks. As a result, our fully self-supervised pre-trained CAV-MAE achieves a new SOTA accuracy of 69.9$\%$ on AudioSet and is comparable with the previous best supervised pre-trained model on VGGSound over audio-visual event classification.
Chat is not available.