Poster
in
Workshop: Adaptive Foundation Models: Evolving AI for Personalized and Efficient Learning
Visual Language Alignment Tuning
LE ZHANG · Qian Yang · Aishwarya Agrawal
Foundation models like CLIP are pivotal for advancing research in vision-language learning, as they simultaneously learn modality-specific representations and cross-modal alignment. However, training these models is resource-intensive, requiring hundreds of millions of image-text pairs and hundreds of GPUs, creating a barrier for advancing research on multimodal alignment. In this paper, we introduce the \textbf{S}wift \textbf{A}lignment of \textbf{I}mage and \textbf{L}anguage (SAIL) framework, which focuses on vision-language alignment by tuning a lightweight alignment layer added on top of frozen pretrained single-modality models. SAIL drastically reduces computational demands, requiring only a single GPU to align the pretrained feature spaces.