Poster
in
Workshop: NeurIPS 2024 Workshop: Machine Learning and the Physical Sciences
CURIE: Evaluating LLMs on Multitask Scientific Long-Context Understanding and Reasoning
Hao Cui · Zahra Shamsi · Xuejian Ma · Gowoon Cheon · Shutong Li · Maria Tikhanovskaya · Nayantara Mudur · Martyna Plomecka · Peter Norgaard · Paul Raccuglia · Victor V. Albert · Yasaman Bahri · Pranesh Srinivasan · Haining Pan · Philippe Faist · Brian Rohr · Michael Statt · Dan Morris · Drew Purves · Elise Kleeman · Ruth Alcantara · Matthew Abraham · Muqthar Mohammad · Ean Phing VanLee · Chenfei Jiang · Elizabeth Dorfman · Eun-Ah Kim · Michael Brenner · Sameera Ponda · Subhashini Venugopalan
Scientific problem-solving involves synthesizing information while applying expert knowledge. We introduce CURIE, a scientific long-Context Understanding and Reasoning Inference Evaluations benchmark to measure the potential of Large Language Models (LLMs) in assisting scientists in realistic workflows. This benchmark introduces ten challenging tasks curated by experts in six disciplines: materials science, condensed matter physics, quantum computing, geospatial analysis, biodiversity, and proteins; covering both experimental and theoretical workflows in science. We evaluate a range of closed and open LLMs on tasks in CURIE which requires domain expertise, comprehension of long in-context information, and multi-step reasoning. While Claude-3 shows consistent high comprehension across domains, the popular GPT-4o and command-R+fail dramatically on protein sequencing tasks. Overall there is much room for improvement for all models. We hope from this work can guide the future development of LLMs in sciences