Poster
in
Workshop: UniReps: Unifying Representations in Neural Models
Vision and language representations in multimodal AI models and human social brain regions during natural movie viewing
Hannah Small · Haemy Lee Masson · Leyla Isik
Keywords: [ audiovisual ] [ multimodal transformers ] [ vision-language ] [ naturalistic stimuli ] [ fMRI ] [ neuroAI ]
Recent work in neuroAI suggests that representations in modern AI vision and language models are highly aligned with each other and human visual cortex. In addition, training AI vision models on language-aligned tasks (e.g., CLIP-style models) improves their match to visual cortex, particularly in regions involved in social perception, suggesting these brain regions may be similarly "language aligned". This prior work has primarily investigated only static stimuli without language, but in our daily lives, we experience the dynamic visual world and communicate about it using language simultaneously. To understand the integration of vision and language during natural viewing, we fit an encoding model to predict voxel-wise responses to an audiovisual movies using visual representations from both purely visual and language-aligned vision transformer models and paired language transformers. We first find that in naturalistic settings, there is remarkably low correlation between representations in vision and language models and both predict social perceptual and language regions well. Next, we find that vision-language alignment does not improve a model's match to neural responses in visual, social perceptual, or language regions, despite social perceptual and language regions being well predicted by both vision and language embeddings. In fact, the language embeddings from the vision-language transformer perform worse than simple word-level embeddings. Our work demonstrates the importance of testing multimodal AI models in naturalistic settings and reveals differences between language alignment in modern AI models and the human brain.