Poster
in
Workshop: MATH-AI: The 4th Workshop on Mathematical Reasoning and AI
Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models
Hyunsik Chae · Seungwoo Yoon · Chloe Yewon Chun · Gyehun Go · Yongin Cho · Gyeongmin Lee · Ernest Ryu
Keywords: [ Benchmarking ] [ vision language models ] [ Large language models ] [ Large Multimodal Models ] [ Vision-Language Reasoning ] [ visual geometry ]
Recent Vision Language Models (VLMs) have demonstrated impressive multimodal comprehension and reasoning capabilities, but they often struggle with trivially simple visual tasks. In this work, we introduce the Atomic Visual Skills Benchmark (AVSBench) to evaluate whether VLMs possess capabilities to understand basic geometric features, which we refer to as atomic visual skills. Specifically, we systematically categorize the atomic visual skills and handcraft a set of 5,073 diverse questions designed to assess each individual atomic visual skill. Using AVSBench, we evaluate the current leading VLMs and find that they struggle with most of these atomic visual skills that are obvious to humans.