Skip to yearly menu bar Skip to main content


Poster
in
Workshop: 5th Workshop on Self-Supervised Learning: Theory and Practice

LLM2CLIP: Extending the Capability Boundaries of CLIP through Large Language Models

Aoqi Wu · weiquan Huang · Yifan Yang · Xufang Luo · Yuqing Yang · Chunyu Wang · Liang Hu · Xiyang Dai · Dongdong Chen · Chong Luo · Lili Qiu


Abstract:

CLIP is one of the most important foundational multimodal models today. It aligns image and text modalities into a shared feature space by leveraging a simple contrastive learning loss on massive image-text pairs.As a retriever, CLIP supports tasks such as zero-shot classification, detection, segmentation, and image-text retrieval. Furthermore, as a cross-modal feature extractor, it enables tasks like image understanding, video understanding, and text-to-image generation. However, as expectations around model generalization and the complexity of tasks increase, the original learning paradigm of CLIP shows limitations in feature extraction capabilities. Specifically, the bag-of-words nature of CLIP's text encoder is often criticized for its inability to extract fine-grained or complex features. We believe these limitations stem from two core issues: the simplicity of the training captions and the fact that CLIP's self-supervised task does not require logical reasoning to succeed. Additionally, the small-scale text encoder used in CLIP cannot fully understand high-quality caption data. In this work, we propose a post-finetuning approach for CLIP by introducing large language models (LLMs) into the training process to leverage more sophisticated textual data. Our experiments demonstrate that even with minimal additional training, LLMs can be aligned with the pretrained CLIP visual encoder, providing higher-dimensional and effective supervision to overcome CLIP's original limitations.

Chat is not available.