Poster
Unified Lexical Representation for Interpretable Visual-Language Alignment
Yifan Li · Tong He · Yikai Wang · Dongyu Ru · Yanwei Fu · Zheng Zhang
Visual-Language Alignment (VLA) has gained a lot of attention since CLIP's groundbreaking work.Although CLIP performs well, the typical direct latent feature alignment lacks clarity in its representation and similarity scores.On the other hand, lexical representation, a vector whose element represents the similarity between the sample and a word from the vocabulary, is a natural sparse representation and interpretable, providing exact matches for individual words.However, lexical representations is difficult to learn due to no ground-truth supervision and false-discovery issues, and thus requires complex design to train effectively.In this paper, we introduce LexVLA, a more interpretable VLA framework by learning a unified lexical representation for both modalities without complex design.We use DINOv2 as our visual model for its local-inclined features and Llama 2, a generative language model, to leverage its in-context lexical prediction ability.To avoid the false discovery, we propose an overuse penalty to refrain the lexical representation from falsely frequently activating meaningless words.We demonstrate that these two pre-trained uni-modal models can be well-aligned by fine-tuning on modest multi-modal dataset and avoid intricate training configurations.On cross-modal retrieval benchmarks, LexVLA, trained on the CC-12M multi-modal dataset, outperforms baselines fine-tuned on larger datasets (e.g., YFCC15M) and those trained from scratch on even bigger datasets (e.g., 1.1B data, including CC-12M).We conduct extensive experiments to analyze LexVLA.
Live content is unavailable. Log in and register to view live content