Skip to yearly menu bar Skip to main content


Poster

INQUIRE: A Natural World Text-to-Image Retrieval Benchmark

Edward Vendrow · Omiros Pantazis · Alexander Shepard · Gabriel Brostow · Kate Jones · Oisin Mac Aodha · Sara Beery · Grant Van Horn

[ ]
Fri 13 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

We introduce INQUIRE, a text-to-image retrieval benchmark for evaluating multimodal vision-language models. INQUIRE includes iNat24, a new dataset of five million natural world images, along with 200 expert-level retrieval queries. These queries are paired with all relevant images comprehensively labeled within iNat24, comprising 24,000 total matches. The 200 queries cover 16 broad natural world categories, addressing challenges related to reasoning about species identification, context, behavior, and image appearance. Compared to existing image retrieval datasets, INQUIRE is larger, contains many image matches for each query, and requires both advanced image understanding and domain expertise. Our benchmark evaluates two core retrieval tasks: (1) INQUIRE-Fullrank, a full dataset ranking task, and (2) INQUIRE-Rerank, a reranking task using a fixed initial ranking of 100 images for each query. Detailed evaluation of a range of recent multimodal models demonstrates that INQUIRE poses a significant challenge, as the best models fail to achieve an mAP@50 above 50%. In addition, we show that reranking with more powerful multimodal models can enhance retrieval performance, yet there remains a significant margin for improvement.

Live content is unavailable. Log in and register to view live content