Poster
in
Workshop: GenAI for Health: Potential, Trust and Policy Compliance
Demo Track: Directing Generalist Vision-Language Models to Interpret Medical Images Across Populations
Luke Sagers · Aashna Shah · Sonnet Xu · Roxana Daneshjou · Arjun Manrai
Keywords: [ large vision-language models; medical image interpretation; demographic bias; prompt engineering ]
As patients and physicians increasingly use large multimodal foundation models, it is urgent to assess the performance and safety of these models across populations and data types. While most studies to date have focused on model-level performance characteristics, it is crucial to conduct more nuanced evaluations to measure how users may knowingly or unknowingly alter model behavior in normal use, such as through different prompt structures. Here, we systematically assess the "steerability" of two leading vision-language models, Gemini Pro Vision (Google) and GPT-4 with Vision (OpenAI), across three common medical imaging tasks: (1) detecting malignancies in dermatological lesions, (2) identifying abnormalities in chest X-ray radiographs, and (3) differentiating tumor epithelium and simple stroma in histological samples. Despite built-in guardrails aimed to limit medical interpretation, our findings show that these safeguards can be easily bypassed through prompting techniques, for example, by rephrasing the task as a ``matching game''. Our results further reveal significant differences in how these models trade off sensitivity and specificity as a function of image type, prompt strategy, and demographic factors. Gemini Pro Vision consistently outperformed GPT-4, achieving maximum balanced accuracies of 0.67 (± 0.04) in dermatology, 0.75 (± 0.04) in radiology, and 0.81 (± 0.02) in histology. Both models showed reduced performance in detecting abnormalities on darker skin tones, older patients' X-rays, and images with extreme pixel intensities. While prompt engineering improved accuracy, the models remain unreliable for medical image analysis and are susceptible to bias, underscoring the need for diverse training datasets and thorough contextual evaluations.