Skip to yearly menu bar Skip to main content


Poster
in
Affinity Workshop: Latinx in AI

Evaluating zero-shot image classification based on visual language model with relation to background shift

Flávio Santos · Maynara Souza · Cleber Zanchettin


Abstract:

This paper explores the sensitivity to changes in the image backgrounds of zero-shot image classifiers, such as those based on Large Language Models (LLMs) and Visual Language Models (VLMs). Specifically, we evaluate the image background robustness of VLM-only and the LLM+VLM image classifiers, verifying how background information influences their similarity scores and subsequently impacts their accuracy. For that analysis, we use the CLIP, ALIGN, ChatGPT+CLIP, and ChatGPT+ALIGN model for zero-shot image classification and compare its performance with baseline architectures such as Vision Transformer (ViT) and ResNet on the ImageNet-9 and RIVAL10 background challenges. The results indicate that all models exhibit some limitations when faced with background shifts; the ChatGPT+CLIP and CLIP-only models experienced a significant decrease in accuracy, suggesting difficulties performing accurate foreground-only and background-shift classification. However, the Align model consistently outperforms the other models in handling background variations. Our findings underscore the ongoing challenge of addressing background shifts in image classification and offer valuable insights for future improvements.

Chat is not available.