Poster
in
Workshop: AIM-FM: Advancements In Medical Foundation Models: Explainability, Robustness, Security, and Beyond
A Benchmark for Long-Form Medical Question Answering
Pedram Hosseini · Jessica Sin · Bing Ren · Bryceton Thomas · Elnaz Nouri · Ali Farahanchi · Saeed Hassanpour
There is a lack of benchmarks for evaluating large language models (LLMs) in long-form medical question answering (QA). Most existing benchmarks for medical QA evaluation focus on automatic metrics and multiple-choice questions. While valuable, these benchmarks do not fully capture or assess the complexities of real-world clinical applications where LLMs are being deployed. Furthermore, the limited studies on long-form answer generation in medical QA are primarily closed-source, with no access to human medical expert annotations, making it difficult to reproduce results and improve baselines. In this work, we introduce a new publicly available benchmark featuring real-world consumer medical questions with long-form answer evaluations annotated by medical doctors. We conduct pairwise comparisons of responses from various open and closed medical and general-purpose LLMs based on criteria such as correctness, helpfulness, harmfulness, and bias. Additionally, we perform a comprehensive LLM-as-a-judge analysis to study the alignment between human judgments and LLMs. Our preliminary results highlight the strong potential of open LLMs in medical QA compared to leading closed models.