Poster
in
Workshop: Pluralistic Alignment Workshop
Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference under Ambiguities
Zheyuan Zhang · Fengyuan Hu · Jayjun Lee · Freda Shi · Parisa Kordjamshidi · Joyce Chai · Ziqiao Ma
In situated communication, ambiguities naturally arise from the chosen reference system, with varying valid interpretations of the same spatial expression depending on the selected frame of reference (FoR). While spatial language understanding and reasoning of vision-language models (VLMs) is receiving increasing attention, the potential ambiguities, along with the commonsense and consistency in spatial reasoning, remain largely under-explored. We present COnsistent Multilingual Frame Of Reference Test (COMFORT), an evaluation protocol designed to systematically assess VLMs on spatial reasoning abilities. We demonstrate that VLMs show alignment with English conventions in spatial language understanding when resolving ambiguities. However, they (1) are still far from achieving robustness and consistency, (2) lack the flexibility to accommodate multiple coordinate systems, and (3) fail to adhere to cultural conventions in cross-lingual tests, as English tends to overshadow other languages. With a growing effort to align vision-language models with human cognition, we highlight the ambiguous nature of spatial language and call for increased attention to cross-cultural diversity in spatial reasoning.