Poster
No Filter: Towards Cultural and Socioeconomic Diversity in Multimodal Systems
AngĂ©line Pouget · Lucas Beyer · Emanuele Bugliarello · Xiao Wang · Andreas Steiner · Xiaohua Zhai · Ibrahim Alabdulmohsin
We study cultural and socioeconomic diversity in contrastive vision-language models (VLMs). Using a broad range of benchmark datasets and evaluation metrics, we bring to attention several important considerations. First, filtering training data to English-only image-text pairs disadvantages communities of lower socioeconomic status and negatively impacts cultural understanding. Notably, this performance gap is not sufficiently captured by - and even at odds with - popular evaluation metrics, such as those derived from the Western-oriented ImageNet and COCO datasets. Second, pretraining with global, unfiltered data before fine-tuning on English-only content improves cultural understanding without sacrificing performance on popular, Western-centric benchmarks. Thirdly, we introduce the task of geo-localization based on datasets such as Crossmodal-3600, Dollar Street, and GeoDE, as a novel metric to assess cultural diversity in these models. Our work underscores the importance of diversifying training data to create more inclusive multimodal systems and lays the groundwork for developing VLMs that better represent global perspectives.
Live content is unavailable. Log in and register to view live content