Poster
Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation?
Pedro R. A. S. Bassi · Wenxuan Li · Yucheng Tang · Fabian Isensee · Zifu Wang · Jieneng Chen · Yu-Cheng Chou · Tassilo Wald · Constantin Ulrich · Michael Baumgartner · Saikat Roy · Klaus Maier-Hein · Paul Jaeger · Yiwen Ye · Yutong Xie · Jianpeng Zhang · Ziyang Chen · Yong Xia · Yannick Kirchhoff · Maximilian R. Rokuss · Pengcheng Shi · Ting Ma · Yuxin Du · Fan BAI · Tiejun Huang · Bo Zhao · Zhaohu Xing · Lei Zhu · Saumya Gupta · Haonan Wang · Xiaomeng Li · Ziyan Huang · Jin Ye · Junjun He · Yousef Sadegheih · Afshin Bozorgpour · Pratibha Kumari · Reza Azad · Dorit Merhof · Hanxue Gu · Haoyu Dong · Jichen Yang · Maciej Mazurowski · Linshan Wu · Jia-Xin Zhuang · Hao CHEN · Holger Roth · Daguang Xu · Matthew Blaschko · Sergio Decherchi · Andrea Cavalli · Alan Yuille · Zongwei Zhou
West Ballroom A-D #5206
How can we test AI performance? This question seems trivial, but it isn't. Standard benchmarks often have underlying problems such as in-distribution and small-size test sets, oversimplified metrics, unfair comparisons, and short-term outcome pressure. As a consequence, good performance on standard benchmarks does not guarantee success in real-world scenarios. We address this misalignment issue with Touchstone, a large-scale collaborative benchmark for medical segmentation. This benchmark is based on annotated CT datasets of unprecedented scale: 5,195 training volumes from 76 medical institutions around the world, and 6,933 testing volumes from 8 additional hospitals. This extensive and diverse test set not only makes the benchmark results more statistically meaningful than existing ones, but also systematically tests AI algorithms in varied out-of-distribution scenarios. We invited 14 inventors of various AI algorithms, categorized as CNN, Transformer, and their combinations, to train their algorithms on the publicly available training set. Our team, as a third party, independently evaluated these algorithms on the test set and reported their pros/cons from multiple perspectives. In addition, we also evaluated publicly available AI frameworks---which are more flexible and can support different algorithms---including MONAI and its Auto3DSeg from NVIDIA, nnU-Net from DKFZ, and numerous other open-source repositories such as vision-language framework developed by researchers. We are committed to expanding this benchmark to encourage more innovation of AI algorithms for the medical domain.