Skip to yearly menu bar Skip to main content


Poster

MedJourney: Benchmark and Evaluation of Large Language Models over Patient Clinical Journey

Xian Wu · Yutian Zhao · Yunyan Zhang · Jiageng Wu · Zhihong Zhu · Yingying Zhang · Yi Ouyang · Ziheng Zhang · Huimin WANG · zhenxi Lin · Jie Yang · Shuang Zhao · Yefeng Zheng

[ ]
Fri 13 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract:

Large language models (LLMs) have demonstrated remarkable capabilities in language understanding and generation, leading to their deployment across various domains. Among these, the medical field is particularly well-suited for LLM applications, as many medical tasks can be enhanced by these models. Despite the existence of benchmarks for evaluating LLMs in medical question-answering and exams, there remains a notable gap in assessing LLMs' performance in supporting patients throughout their entire hospital visit journey in real-world clinical practice. In this paper, we address this gap by dividing a typical patient's hospital visit journey into four stages: planning, access, delivery and ongoing care. For each stage, we introduce multiple tasks and provide corresponding datasets. In total, the proposed benchmark comprises 12 datasets, of which five are newly introduced, and seven are constructed from existing datasets. This proposed benchmark allows us to cover the entire patient journey, thereby offering a comprehensive assessment of LLMs' effectiveness in real-world clinical settings. In addition to introducing this benchmark, we also evaluate three categories of LLMs against it: 1) proprietary LLM services such as GPT-4; 2) public LLMs like QWen; and 3) specialized medical LLMs, like HuatuoGPT2. Through this comprehensive evaluation, we aim to provide a more understanding of LLMs' performance in the medical domain, ultimately contributing to their more effective deployment in healthcare settings.

Live content is unavailable. Log in and register to view live content