Skip to yearly menu bar Skip to main content


Poster
in
Workshop: UniReps: Unifying Representations in Neural Models

Emergence of Text Semantics in CLIP Image Encoders

Sreeram Vennam · Shashwat Singh · Anirudh Govil · Ponnurangam Kumaraguru

Keywords: [ clip ] [ image encoders ] [ representations ] [ interpretability ]


Abstract:

Certain self-supervised approaches to train image encoders, like CLIP, align images with their text captions. However, these approaches do not have an a priori incentive to learn to associate text inside the image with the semantics of the text. Humans process text visually; our work studies the semantics of text rendered in images. We show that the semantic information captured by image representations can decisively classify the sentiment of sentences and is robust against visual attributes like font and not based on simple character frequency associations.

Chat is not available.