Poster
in
Workshop: Workshop on Distribution Shifts: New Frontiers with Foundation Models
Probing the Equivariance of Image Embeddings
Cyrus Rashtchian · Charles Herrmann · Chun-Sung Ferng · Ayan Chakrabarti · Dilip Krishnan · Deqing Sun · Da-Cheng Juan · Andrew Tomkins
Keywords: [ Probing ] [ Distribution Shift ] [ image embeddings ] [ OOD Detection ] [ robustness ]
Probes are small networks that predict properties of underlying data from embeddings, and they provide a targeted way to illuminate the information in embeddings. While analysis with probes has become standard in NLP, there has been less exploration in vision. Our goal is to understand the invariance vs. equivariance of popular image embeddings (e.g., MAE, SimCLR, or CLIP) under certain distribution shifts. By doing so, we investigate what visual aspects from the raw images are encoded into the embeddings by these foundation models. Our probing is based on a systematic transformation prediction task that measures the visual content of embeddings along many axes, including neural style transfer, recoloring, icon/text overlays, noising, and blurring. Surprisingly, six embeddings (including SimCLR) encode enough non-semantic information to identify dozens of transformations. We also consider a generalization task, where we group similar transformations and hold out several for testing. Image-text models (CLIP, ALIGN) are better at recognizing new examples of style transfer than masking-based models (CAN, MAE). Our results show that embeddings from foundation models are equivariant and encode more non-semantic features than a supervised baseline. Hence, their OOD generalization abilities are not due to invariance to such distribution shifts.