Poster
Understanding the Limits of Vision Language Models Through the Lens of the Binding Problem
Declan Campbell · Sunayana Rane · Tyler Giallanza · Camillo Nicolò De Sabbata · Kia Ghods · Amogh Joshi · Alexander Ku · Jonathan D Cohen · Tom Griffiths · Taylor Webb
Recent work has documented striking heterogeneity in the performance of state-of-the-art vision language models (VLMs) such as GPT-4v and the DALL-E text-to-image models. These models are able to describe and generate an incredibly diverse array of complex, naturalistic images, yet they exhibit surprising failures on basic multi-object reasoning tasks -- such as counting, localization, and simple forms of visual analogy -- that humans perform with near perfect accuracy. To better understand this puzzling pattern of successes and failures, we draw on theoretical accounts from cognitive science that postulate a fundamental trade-off between representational flexibility (i.e., the use of compositional representations to promote generalization) and channel capacity (i.e., the number of entities that can be represented at any one time). This trade-off gives rise to the classic binding problem, leading to severe constraints on the ability to rapidly process multi-object scenes, and necessitating the use of serial processing to prevent interference. Drawing on this perspective, we hypothesize that VLMs, under pressure for generalization, also learn structured representations, but lack the serial processing mechanisms to effectively use these to process and generate multi-object scenes, resulting in severe capacity constraints similar to those observed when humans are forced to rely on rapid, parallel visual processing. We test this hypothesis through a combination of classic cognitive tasks and novel benchmarks. Our results provide a unique perspective on VLMs, informed by work in cognitive science, suggesting that their capacity for generalization paradoxically gives rise to many of their most notable limitations, possibly for the same reasons humans exhibit a similar profile of competencies and limitations.
Live content is unavailable. Log in and register to view live content