Skip to yearly menu bar Skip to main content


Poster

Right this way: Can VLMs Guide Us to See More to Answer Questions?

Li Liu · Diji Yang · Sijia Zhong · Kalyana Suma Sree Tholeti · Lei Ding · Yi Zhang · Leilani Gilpin

[ ]
Thu 12 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

In question-answering scenarios, humans can assess whether the available information is sufficient and seek additional information if necessary, rather than providing a forced answer. In contrast, Vision Language Models (VLMs) typically generate direct, one-shot responses without evaluating the sufficiency of the information. To investigate this gap, we identify a critical yet challenging task in the Visual-Question-Answering (VQA) scenario: can VLMs indicate how to adjust an image when the visual information is insufficient to answer a question? This capability is especially valuable for assisting visually impaired individuals. To evaluate this capability of current VLMs, we introduce a human-labeled dataset as a benchmark for this task. Additionally, we present an automated pipeline that generates synthetic training data by simulating "where to know" scenarios. Our empirical results demonstrate significant performance improvements when the synthetic data is used to fine-tune mainstream VLMs. Our study highlights the potential to bridge the gap between human-like information assessment and acquisition process.

Live content is unavailable. Log in and register to view live content