NeurIPS Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA

Poster
in
Workshop: GenAI for Health: Potential, Trust and Policy Compliance

Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA

Qianqi Yan · Xuehai He · Xiang Yue · Xin Eric Wang

Keywords: [ Benchmark ] [ Vision and Language ] [ Robust Evaluation ]

[ Abstract ] [ Project Page ]

[ Poster] [ OpenReview]

Abstract:

Large Multimodal Models (LMMs) have shown remarkable progress in medical Visual Question Answering (Med-VQA), achieving high accuracy on existing benchmarks. However, their reliability under robust evaluation is questionable. This study reveals that state-of-the-art models perform worse than random guessing on medical diagnosis questions when subjected to simple probing evaluation. To address this critical evaluation problem, we introduce the Probing Evaluation for Medical Diagnosis (ProbMed) dataset to rigorously assess LMM performance in medical imaging through probing evaluation and procedural diagnosis. Particularly, probing evaluation features pairing original questions with negation questions with hallucinated attributes, while procedural diagnosis requires reasoning across various diagnostic dimensions for each image, including modality recognition, organ identification, clinical findings, abnormalities, and positional grounding. Our evaluation reveals that top-performing models like GPT-4o, GPT-4V and Gemini Pro perform worse than random guessing on specialized diagnostic questions, indicating significant limitations in handling fine-grained medical inquiries. We further investigate the underperformance of open-source models (e.g., LLaVA, LLaVA-Med, and Med-Flamingo) through an ablation study. This study reveals that poor visual understanding is a primary bottleneck, which can be mitigated by adding visual descriptions generated by GPT-4o, leading to an average performance improvement of 9.44%. These findings underscore the urgent need for more robust evaluation methods and domain-specific expertise to ensure LMM reliability in critical medical fields.

Chat is not available.

Poster in Workshop: GenAI for Health: Potential, Trust and Policy Compliance

Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA

Qianqi Yan · Xuehai He · Xiang Yue · Xin Eric Wang

Poster
in
Workshop: GenAI for Health: Potential, Trust and Policy Compliance