Spotlight
in
Workshop: Multimodal Algorithmic Reasoning Workshop
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
Zirui Wang · Mengzhou Xia · Luxi He · Howard Chen · Yitao Liu · Richard Zhu · Kaiqu Liang · Xindi Wu · Haotian Liu · Sadhika Malladi · Alexis Chevalier · Sanjeev Arora · Danqi Chen
Sun 15 Dec 8:25 a.m. PST — 5:05 p.m. PST
Chart understanding plays a pivotal role when applying Multimodal Large Language Models (MLLMs) to real-world tasks such as analyzing scientific papers or financial reports. However, existing datasets often focus on oversimplified and homogeneous charts with template-based questions, leading to an over-optimistic measure of progress. We demonstrate that although open-source models can appear to outperform strong proprietary models on these benchmarks, a simple stress test with slightly different charts or questions can deteriorate performance by up to 34.5%. We propose CharXiv, a comprehensive evaluation suite involving 2,323 natural, challenging, and diverse charts from arXiv papers. Our results reveal a substantial, previously underestimated gap between the reasoning skills of the strongest proprietary model (i.e., Claude 3.5 Sonnet), which achieves 60.2% accuracy, and the strongest open-source model (i.e., InternVL Chat V2.0), which achieves 38.9%. All models lag far behind human performance of 80.5%, underscoring weaknesses in the chart understanding capabilities of existing MLLMs. We hope CharXiv facilitates future research on MLLM chart understanding by providing a more realistic and faithful measure of progress.