Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Statistical Frontiers in LLMs and Foundation Models

AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?

Han Bao · Yanbo Wang · Jiayi Ye · Yue Huang · Xiangqi Wang · Xiangliang Zhang

Keywords: [ framework ] [ Large Vision-Language Model ] [ benchmark ]

[ ] [ Project Page ]
Sat 14 Dec noon PST — 12:45 p.m. PST

Abstract:

Large Vision-Language Models (LVLMs) have become essential for advancing the integration of visual and linguistic information, facilitating a wide range of complex applications and tasks. However, the evaluation of LVLMs presents significant challenges. With the advanced application of LVLMs, we want to ask a question: Can LVLMs serve as a path to automatic benchmarking?. To study this quesiton, we introduce AutoBench-V, an automated framework for benchmarking LVLMs. AutoBench-V leverages text-to-image models for generating image data items and utilizes LVLMs to orchestrate visual question-answering (VQA) generation and evaluation. To enhance the accuracy and diversity of the generated data, AutoBench-V incorporates hierarchical aspect generation, self-validation for alignment, and error-controlled case generation. Through an extensive evaluation of seven popular LVLMs across five user evaluation inputs, the framework shows effectiveness and reliability. We observe the following: (1) Model performance generally declines as task difficulty increases; (2) As task difficulty rises, the performance gap between models widens; and (3) While models exhibit strong performance in abstract level understanding, they underperform in details reasoning tasks. Overall, AutoBench-V highlights the significant potential of LVLMs in advancing automatic benchmarking.

Chat is not available.