Poster
Generalization Beyond Data Imbalance: A Controlled Study on CLIP for Transferable Insights
Xin Wen · Bingchen Zhao · Yilun Chen · Jiangmiao Pang · Xiaojuan Qi
The CLIP model has demonstrated great effectiveness in learning generalizable representations from web-scale vision-language pre-training datasets. This paper is motivated by a surprising finding that CLIP, trained on naturally imbalanced web data, exhibits remarkable robustness to data imbalance compared to supervised learning. We aim to investigate the reasons behind this robustness. Our controlled experiments reveal that CLIP's pretext task forms a dynamic classification problem wherein only a subset of classes is present in training. This isolates biases from dominant classes and implicitly balances the learning signal. Furthermore, we demonstrate that the robustness and discriminability of CLIP improve through increased density in language supervision, larger data scale, and broader open-world concepts, which are inaccessible to supervised learning. Our study not only uncovers the mechanisms behind CLIP's generalizability beyond data imbalance but also provides transferable insights for the research community. The findings are validated in both supervised and self-supervised learning, enabling models trained on imbalanced data to achieve CLIP-level performance on diverse recognition tasks. The code and data will be made publicly available.
Live content is unavailable. Log in and register to view live content