Poster
in
Workshop: Statistical Frontiers in LLMs and Foundation Models
AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?
Han Bao · Yanbo Wang · Jiayi Ye · Yue Huang · Xiangqi Wang · Xiangliang Zhang
Keywords: [ framework ] [ Large Vision-Language Model ] [ benchmark ]
Large Vision-Language Models (LVLMs) have become essential for advancing the integration of visual and linguistic information, facilitating a wide range of complex applications and tasks. However, the evaluation of LVLMs presents significant challenges. With the advanced application of LVLMs, we want to ask a question: Can LVLMs serve as a path to automatic benchmarking?. To study this quesiton, we introduce AutoBench-V, an automated framework for benchmarking LVLMs. AutoBench-V leverages text-to-image models for generating image data items and utilizes LVLMs to orchestrate visual question-answering (VQA) generation and evaluation. To enhance the accuracy and diversity of the generated data, AutoBench-V incorporates hierarchical aspect generation, self-validation for alignment, and error-controlled case generation. Through an extensive evaluation of seven popular LVLMs across five user evaluation inputs, the framework shows effectiveness and reliability. We observe the following: (1) Model performance generally declines as task difficulty increases; (2) As task difficulty rises, the performance gap between models widens; and (3) While models exhibit strong performance in abstract level understanding, they underperform in details reasoning tasks. Overall, AutoBench-V highlights the significant potential of LVLMs in advancing automatic benchmarking.