NeurIPS Poster MedJourney: Benchmark and Evaluation of Large Language Models over Patient Clinical Journey

Poster

MedJourney: Benchmark and Evaluation of Large Language Models over Patient Clinical Journey

Xian Wu · Yutian Zhao · Yunyan Zhang · Jiageng Wu · Zhihong Zhu · Yingying Zhang · Yi Ouyang · Ziheng Zhang · Huimin WANG · zhenxi Lin · Jie Yang · Shuang Zhao · Yefeng Zheng

East Exhibit Hall A-C #4207

[ Abstract ]

[ Paper]

Fri 13 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract:

Large language models (LLMs) have demonstrated remarkable capabilities in language understanding and generation, leading to their widespread adoption across various fields. Among these, the medical field is particularly well-suited for LLM applications, as many medical tasks can be enhanced by LLMs. Despite the existence of benchmarks for evaluating LLMs in medical question-answering and exams, there remains a notable gap in assessing LLMs' performance in supporting patients throughout their entire hospital visit journey in real-world clinical practice. In this paper, we address this gap by dividing a typical patient's clinical journey into four stages: planning, access, delivery and ongoing care. For each stage, we introduce multiple tasks and corresponding datasets, resulting in a comprehensive benchmark comprising 12 datasets, of which five are newly introduced, and seven are constructed from existing datasets. This proposed benchmark facilitates a thorough evaluation of LLMs' effectiveness across the entire patient journey, providing insights into their practical application in clinical settings. Additionally, we evaluate three categories of LLMs against this benchmark: 1) proprietary LLM services such as GPT-4; 2) public LLMs like QWen; and 3) specialized medical LLMs, like HuatuoGPT2. Through this extensive evaluation, we aim to provide a better understanding of LLMs' performance in the medical domain, ultimately contributing to their more effective deployment in healthcare settings.

Chat is not available.