Poster
Contrasting with Symile: Simple Model-Agnostic Representation Learning for Unlimited Modalities
Adriel Saporta · Aahlad Manas Puli · Mark Goldstein · Rajesh Ranganath
While contrastive learning was originally designed to maximize the mutual information between two modalities, domains such as robotics, healthcare, and video need to support many types of data at once. We show that the pairwise application of CLIP fails to capture joint and conditional information between modalities, thereby limiting the quality of learned representations. To address this issue, we present Symile, a simple contrastive learning objective that accommodates any number of modalities and allows any model to produce representations for each modality. Symile targets total correlation, a measure that captures the statistical dependence between an arbitrary number of variables. To develop the objective for Symile, we derive a lower bound on total correlation that employs a generalization of inner products, and show that Symile representations for any set of modalities form sufficient statistics for the remaining modalities. Symile outperforms pairwise CLIP, even with modalities missing in the data, on cross-modal classification and retrieval across several experiments including on an original multilingual dataset of 33M image, text and audio samples and a clinical dataset of chest X-rays, electrocardiograms, and laboratory measurements.
Live content is unavailable. Log in and register to view live content