Poster
SciCode: A Research Coding Benchmark Curated by Scientists
Minyang Tian · Luyu Gao · Shizhuo Zhang · Xinan Chen · Cunwei Fan · Xuefei Guo · Roland Haas · Pan Ji · Kittithat Krongchon · Yao Li · Shengyan Liu · Di Luo · Yutao Ma · HAO TONG · Kha Trinh · Chenyu Tian · Zihan Wang · Bohao Wu · Shengzhu Yin · Minhui Zhu · Kilian Lieret · Yanxin Lu · Genglin Liu · Yufeng Du · Tianhua Tao · Ofir Press · Jamie Callan · Eliu Huerta · Hao Peng
West Ballroom A-D #5204
As language models (LMs) outperform average humans on many challenging tasks, it has become increasingly difficult for developing challenging, high-quality, and realistic evaluations. We take steps towards addressing this by proposing \name, a scientific coding benchmark curated by scientists. \name contains complex research-level code generation problems from various natural science fields. These domains offer high-quality data that most current LMs have less exposure to, helping evaluate the models' ability to generalize to unfamiliar settings. This makes \name highly challenging---GPT-4o, the best-performing model among those tested, can solve only 3.4\% of the problems in the most realistic setting. To ensure high quality, \name is annotated by natural scientists and AI researchers at a senior PhD student level or above. The problems are organized hierarchically, and every main problem is broken down into multiple easier subproblems. Each problem includes optional descriptions laying out the necessary scientific knowledge, as well as scientist-annotated gold solutions and test cases for evaluation. In total, \name contains 305 subproblems decomposed from 73 main problems. \name stress tests the models' comprehensive capabilities. Solving a problem requires a deep understanding of scientific knowledge, the ability to decompose complex problems into easier subproblems and solve each correctly, and the skill to integrate these solutions into a complete solution. Besides being a challenging, high-quality, and realistic coding benchmark, \name evaluates LMs’ potential to accelerate scientific discovery by assisting scientists' workflows. We believe \name can incentivize future research into this area, which has benefited less from LM advancements compared to general-domain coding.