Poster
VHELM: A Holistic Evaluation of Vision Language Models
Tony Lee · Haoqin Tu · Chi Heem Wong · Wenhao Zheng · Yiyang Zhou · Yifan Mai · Josselin Roberts · Michihiro Yasunaga · Huaxiu Yao · Cihang Xie · Percy Liang
Current benchmarks for assessing vision-language models (VLMs) often focus on their perception or problem-solving capabilities and neglect other equally critical aspects such as fairness, unbiasedness, or toxicity. Furthermore, they differ in their evaluation procedures and the scope of the evaluation, making it difficult to compare models. To address these issues, we extend the HELM framework to VLMs to present the Holistic Evaluation of Vision Language Models (VHELM). VHELM aggregates various benchmark datasets and maps the scenarios to one or more of the 8 aspects: unbiasedness, fairness, knowledge, multilinguality, reasoning, robustness, toxicity mitigation, and perception. In doing so, we produce a comprehensive, multi-dimensional view of the capabilities of the VLMs across these important factors. In addition, we standardize the standard inference parameters, methods of prompting, and evaluation metrics to enable fair comparisons between models. Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast. Our initial run evaluates 18 VLMs on 19 datasets to provide a holistic snapshot of the models. We uncover new key findings, such as the observation that no single model excels across all aspects (as of the time of writing). For transparency, we release the raw model generations and complete results on our website at https://crfm.stanford.edu/helm/vhelm/v2.0.0. VHELM is meant to be a living benchmark, and we hope to continue adding new scenarios and models over time.
Live content is unavailable. Log in and register to view live content