Poster
in
Workshop: Attributing Model Behavior at Scale (ATTRIB)
Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling
Haider Al-Tahan · Quentin Garrido · Randall Balestriero · Diane Bouchacourt · Caner Hazirbas · Mark Ibrahim
Significant research efforts have been made to scale and improve vision-language model (VLM) training approaches.Yet, with an ever-growing number of benchmarks,researchers are tasked with the heavy burden of implementing each protocol, bearing a non-trivial computational cost, and making sense of how all these benchmarks translate into meaningful axes of progress.To facilitate a systematic evaluation of VLM progress, we utilize \texttt{UniBench}: a unified implementation of 50+ VLM benchmarks spanning a comprehensive range of carefully categorized capabilities from object recognition to spatial awareness, counting, and much more. We measure progress by evaluating nearly 60 publicly available vision-language models, trained on scales of up to 12.8B samples. We find that while scaling training data or model size can boost many vision-language model capabilities, scaling offers little benefit for reasoning or relations. Surprisingly, we also discover today's best VLMs struggle on simple digit recognition and counting tasks, e.g. MNIST, that can be solved with much simpler networks. Where scale falls short, we find that more precise interventions, such as data quality or tailored-learning objectives offer more promise. For practitioners, we also offer guidance on selecting a suitable VLM for a given application.