Poster
in
Workshop: Compositional Learning: Perspectives, Methods, and Paths Forward
OC-CLIP : Object-centric Binding in Contrastive Language-Image Pretraining
Rim Assouel · Pietro Astolfi · Florian Bordes · Michal Drozdzal · Adriana Romero
Keywords: [ object-centric ] [ CLIP ] [ compositional image-text matching ]
Recent advancements in vision-language models (VLMs) have been driven by contrastive models like CLIP, which learn to associate visual information with their corresponding text descriptions. However, these models have limitations in understanding complex compositional scenes involving multiple objects and their spatial relationships. To address these challenges, we propose a novel approach that diverges from traditional data-centric methods of enhancing model performance with hard negatives examples. Our work instead focuses on integrating sufficient inductive biases into pre-trained CLIP-like models to improve their compositional understanding without using additional data annotations. We introduce a binding module that connects a scene graph of the text with an induced graph-like representation of the image, facilitating a structured similarity assessment. We also leverage relationships as text-conditioned visual constraints, thereby capturing the intricate interactions between objects and their contextual relationships more effectively. Our resulting model (OC-CLIP) not only enhances the performance of CLIP in multi-object compositional understanding but also paves the way for more accurate and efficient image-text matching in complex scenes.