Skip to yearly menu bar Skip to main content


Poster
in
Workshop: MATH-AI: The 4th Workshop on Mathematical Reasoning and AI

Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models

Hyunsik Chae · Seungwoo Yoon · Chloe Yewon Chun · Gyehun Go · Yongin Cho · Gyeongmin Lee · Ernest Ryu

Keywords: [ Benchmarking ] [ vision language models ] [ Large language models ] [ Large Multimodal Models ] [ Vision-Language Reasoning ] [ visual geometry ]


Abstract:

Recent Vision Language Models (VLMs) have demonstrated impressive multimodal comprehension and reasoning capabilities, but they often struggle with trivially simple visual tasks. In this work, we introduce the Atomic Visual Skills Benchmark (AVSBench) to evaluate whether VLMs possess capabilities to understand basic geometric features, which we refer to as atomic visual skills. Specifically, we systematically categorize the atomic visual skills and handcraft a set of 5,073 diverse questions designed to assess each individual atomic visual skill. Using AVSBench, we evaluate the current leading VLMs and find that they struggle with most of these atomic visual skills that are obvious to humans.

Chat is not available.