Poster
in
Workshop: UniReps: Unifying Representations in Neural Models
Mixture of Multimodal Interaction Experts
Haofei Yu · Paul Pu Liang · Russ Salakhutdinov · Louis-Philippe Morency
Multimodal machine learning, which studies the information and interactions across various input modalities, has made significant advancements in understanding the relationship between images and descriptive text. Yet, this is just a portion of the potential multimodal interactions in the real world, such as sarcasm in conflicting utterance and gestures. Notably, the current methods for capturing this shared information often don't extend well to these more nuanced interactions. Current models, in fact, show particular weaknesses with disagreement and synergistic interactions, sometimes performing as low as 50\% in binary classification. In this paper, we address this problem via a new approach called mixture of multimodal interaction experts. This method automatically classifies datapoints from unlabeled multimodal dataset by their intereaction types, then employs specialized models for each specific interaction. Based on our experiments, this approach has improved performance on these challenging interactions to more than 10%, leading to an overall increase of 2% for tasks like sarcasm prediction. As a result, not only does interaction quantification provide new insights for dataset analysis, but also simple approaches to obtain state-of-the-art performance.