Poster
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples
Baiqi Li · Zhiqiu Lin · WENXUAN PENG · Jean de Dieu Nyandwi · Daniel Jiang · Zixian Ma · Simran Khanuja · Ranjay Krishna · Graham Neubig · Deva Ramanan
East Exhibit Hall A-C #3601
Vision-language models (VLMs) have made significant progress in recent visual-question-answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However, are these models truly effective? In this work, we show that VLMs still struggle with natural images and questions that humans can easily answer, which we term natural adversarial samples. We also find it surprisingly easy to generate these VQA samples from natural image-text corpora using off-the-shelf models like CLIP and ChatGPT. We propose a semi-automated approach to collect a new NaturalBench benchmark for reliably evaluating VLMs with over 10,000 human-verified VQA samples. Crucially, we adopt a vision-centric design by pairing each question with two images that yield different answers, preventing ``blind'' solutions from answering without using the images. This makes NaturalBench more challenging than previous benchmarks that can largely be solved with language priors like commonsense knowledge. Popular VLMs like InstructBLIP, LLaVA-NeXT, ShareGPT4V, and XGen-MM (BLIP-3) only achieve 1%-15% above random chance performance. Even the best (closed-source) GPT-4o lags significantly behind human performance (which is above 90%). We analyze why NaturalBench is hard from two angles: (1) Compositionality: Solving NaturalBench requires diverse visio-linguistic skills, including understanding attribute bindings, object relationships, and advanced reasoning like logic and counting. To this end, unlike prior work that uses a single tag per sample, we tag each NaturalBench sample with 1 to 8 skill tags for fine-grained evaluation. (2) Biases: NaturalBench exposes severe biases in VLMs, as models often choose the same answer regardless of the image. We show that debiasing can be crucial for VLM performance. Lastly, we apply our benchmark curation method to diverse data sources, including long captions (over 100 words) and non-English languages like Chinese and Hindi, highlighting its potential for dynamic evaluations of VLMs.