Skip to yearly menu bar Skip to main content


Poster
in
Workshop: The Fourth Workshop on Efficient Natural Language and Speech Processing (ENLSP-IV): Highlighting New Architectures for Future Foundation Models

Less is Enough: Adapting Pre-trained Vision Transformers for Audio-Visual Speaker Verification

Gnana Praveen Rajasekhar · MD JAHANGIR ALAM

Keywords: [ Efficient Training ]


Abstract:

Speaker Verification has achieved significant improvement in performance using sophisticated deep learning architectures, specialized for speech signals as well as robust loss functions. Recently, the fusion of faces and voices received a lot of attention as they offer complementary relationship with each other, which has the potential to outperform systems with only speech signals. Inspired by the massive success of Vision Transformers (ViTs) in computer vision, ViTs have also been explored for multimodal learning. In this work, we have investigated the potential of ViTs, pre-trained on visual data, for audio-visual speaker verification. To cope with the challenges of large-scale training, we introduce the Latent Audio-Visual Vision Transformer (LAVViT) adapters, where we exploit the existing pre-trained models on visual data by training only the parameters of LAVViT adapters, without fine-tuning the original parameters of the pre-trained models. The LAVViT adapters are injected into every layer of the ViT architecture to effectively fuse the audio and visual modalities using a small set of latent tokens, thereby avoiding the quadratic computational cost of cross-attention across the modalities. The proposed approach has been evaluated on the Voxceleb1 dataset and shows promising performance using only a few trainable parameters.

Chat is not available.