Poster
in
Workshop: Meta-Learning
Learning in Low Resource Modalities via Cross-Modal Generalization
Paul Pu Liang
The natural world is abundant with underlying concepts expressed naturally in multiple heterogeneous sources such as the visual, acoustic, tactile, and linguistic modalities. Despite vast differences in these raw modalities, humans seamlessly perceive multimodal data, learn new concepts, and show extraordinary capabilities in generalizing across input modalities. Much of the existing progress in multimodal learning, however, focuses primarily on problems where the same set of modalities are present at train and test time, which makes learning in low-resource modalities particularly difficult. In this work, we propose a general algorithm for cross-modal generalization: a learning paradigm where data from more abundant source modalities is used to learn useful representations for scarce target modalities. Our algorithm is based on meta-alignment, a novel method to align representation spaces across modalities while ensuring quick generalization to new concepts. Experimental results on generalizing from image to audio classification and from text to speech classification demonstrate strong performance on classifying data from an entirely new target modality with only a few (1-10) labeled samples. In addition, our method works particularly well when the target modality suffers from noisy or limited labels, a scenario particularly prevalent in low-resource modalities.