Poster
in
Workshop: 4th Workshop on Self-Supervised Learning: Theory and Practice
Enhancing CLIP with a Third Modality
Efthymios Tsaprazlis · Georgios Smyrnis · Alex Dimakis · Petros Maragos
Abstract:
We study the problem of training a third tower for a new modality given a pre-trained CLIP model. This extra part of the architecture can be used to incorporate other modalities in the model pipeline. In our setting, we consider the use of a model such as BLIP-2, which provides us with a dialogue centered around the image. We evaluate our model in the setting of image and text retrieval, and compare it against the regular image and text based one.
Chat is not available.