Poster
in
Workshop: MATH-AI: The 4th Workshop on Mathematical Reasoning and AI
HARDMATH: A Benchmark Dataset for Challenging Problems in Applied Mathematics
Jingxuan Fan · Sarah Martinson · Erik Wang · Kaylie Hausknecht · Jonah Brenner · Danxian Liu · Nianli Peng · Corey Wang · Michael Brenner
Keywords: [ Reasoning ] [ dataset ] [ benchmark ] [ math ]
Advanced applied mathematics problems are not well-represented in existing benchmarking datasets used to evaluate Large Language Models (LLMs). To address this, we introduce HARDMath, the Harvard Approximate Reasoning Dataset for Mathematics—a dataset of 1,466 difficult problems inspired by Harvard University’s graduate course on asymptotic methods. The dataset contains a diverse set of challenging applied mathematics problems with worked solutions that employ various analytical approximation methods. Developing such solutions typically requires multiple modes of analysis—including mathematical reasoning, the use of computational tools, and subjective judgment—making this a challenging problem for LLMs. We establish a framework that auto-generates an arbitrarily large number of ‘hard’ applied mathematics problems with approximate analytical solutions that include validity checks against numerical ground-truths. We evaluate frontier LLMs on HARDMath-mini, a sub-sampled test set of 366 problems, as well as on 40 word problems formulated in applied science contexts. Even leading closed-source models like GPT-4 achieve only 43.8% overall accuracy with few-shot Chain-of-Thought prompting, and all models demonstrate significantly lower performance compared to results on existing mathematics benchmark datasets. We additionally conduct a detailed error analysis to gain insights into the failure cases of LLMs. These results demonstrate limitations of current LLM performance on advanced graduate-level asymptotic math problems and underscore the importance of datasets like HARDMath to advance mathematical abilities of LLMs.