Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Adaptive Foundation Models: Evolving AI for Personalized and Efficient Learning

Fine-Grained Visual Recognition in the Age of Multimodal LLMs

Hari Chandana Kuchibhotla · Abbavaram Gowtham Reddy · Sai Srinivas Kancheti · Vineeth N Balasubramanian


Abstract: Fine-grained Visual Recognition (FGVR) involves differentiating between visually similar categories, and is challenging due to subtle differences between the categories and the need for large, expert-annotated datasets. We observe that recent Multimodal Large Language Models (MLLMs) demonstrate potential in FGVR, but querying such models for every test input is not practical due to high costs and time inefficiencies. To address this, we propose a novel pipeline that fine-tunes a CLIP model for FGVR by leveraging MLLMs. Our approach requires only a small support set of unlabeled images to construct a weakly supervised dataset, with MLLMs as label generators. To mitigate the impact of obtained noisy labels, we construct a candidate set for each image using labels of neighboring images, thereby increasing the likelihood of having the correct label in the candidate set. We then employ a partial label learning algorithm to fine-tune a CLIP model using these candidate sets. Our method sets a new benchmark for efficient fine-grained classification, achieving comparable performance to MLLMs at just $1/100^{th}$ of the inference cost and a fraction of the time taken.

Chat is not available.