Skip to yearly menu bar Skip to main content


Poster

What matters when building vision-language models?

Hugo Laurençon · Leo Tronchon · Matthieu Cord · Victor Sanh

[ ]
Wed 11 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

The growing interest in vision-language models (VLMs) has been driven by improvements in large language models and vision transformers. Despite the abundance of literature on this subject, we observe that critical decisions regarding the design of VLMs are often not justified. We argue that these unsupported decisions impede progress in the field by making it difficult to identify which choices improve model performance. To address this issue, we conduct extensive experiments around pre-trained models, architecture choice, data, and training methods. Our consolidation of findings includes the development of XXXXXX, an efficient foundational VLM of 8 billion parameters. XXXXXX achieves state-of-the-art performance within its size category across various multimodal benchmarks, and is often on par with models four times its size. We release the model (base, instructed, and chat) along with the datasets created for its training.

Live content is unavailable. Log in and register to view live content