Poster
Extending Multi-modal Contrastive Representations
Ziang Zhang · Zehan Wang · Luping Liu · Rongjie Huang · Xize Cheng · Zhenhui Ye · wang lin · Huadai Liu · Haifeng Huang · Yang Zhao · Tao Jin · Siqi Zheng · Zhou Zhao
[
Abstract
]
Thu 12 Dec 11 a.m. PST
— 2 p.m. PST
Abstract:
Multi-modal contrastive representation (MCR) of more than three modalities is critical in multi-modal learning. Although recent methods showcase impressive achievements, the high dependence on large-scale, high-quality paired data and the expensive training costs limit their further development. Inspired by recent C-MCR, this paper proposes $\textbf{Ex}$tending $\textbf{M}$ultimodal $\textbf{C}$ontrastive $\textbf{R}$epresentation (Ex-MCR), a training-efficient and paired-data-free method to build unified contrastive representation for many modalities. Since C-MCR is designed to learn a new latent space for the two non-overlapping modalities and projects them onto this space, a significant amount of information from their original spaces is lost in the projection process. To address this issue, Ex-MCR proposes to extend one modality's space into the other's, rather than mapping both modalities onto a completely new space. This method effectively preserves semantic alignment in the original space. Experimentally, we extend pre-trained audio-text and 3D-image representations to the existing vision-text space. Without using paired data, Ex-MCR achieves comparable performance to advanced methods on a series of audio-image-text and 3D-image-text tasks and achieves superior performance when used in parallel with data-driven methods. Moreover, semantic alignment also emerges between the extended modalities (e.g., audio and 3D).
Live content is unavailable. Log in and register to view live content