Poster
in
Workshop: Fine-Tuning in Modern Machine Learning: Principles and Scalability
An empirical study of CLIP fine-tuning with similarity clusters
Shixuan Liu · Yiwei Lyu · Honglak Lee · Todd Hollon
With the success of CLIP training for learning transferable visual representations, fine-tuning CLIP models on smaller datasets for better downstream performance is an important area of research. A method for improving CLIP models is to increase the difficulty of negative examples. While the majority of research has focused on manually crafting hard negative captions, this strategy requires additional engineering labor, fails to generalize to different domains, and causes additional overfitting. Here, we conduct an empirical study to systematically explore an alternative approach: construct minibatches that include similarity clusters to increase the difficulty of negative examples. We propose a generalized framework, called SimCLIP, for similarity-based CLIP fine-tuning. By enforcing that each minibatch contains clusters of similar examples, SimCLIP fine-tuning can improve model performance compared to standard CLIP fine-tuning. We extensively study which SimCLIP configurations and factors contribute most to downstream performance. We also analyze SimCLIP's performance on rare special sets, compositionality of attributes, and generalization across dataset sizes. Our observations provide a better understanding of similarity-based minibatch construction methods as well as new insights into CLIP fine-tuning.