NeurIPS Putnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning

Poster
in
Workshop: MATH-AI: The 4th Workshop on Mathematical Reasoning and AI

Putnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning

Aryan Gulati · Brando Miranda · Eric Chen · Emily Xia · Kai Fronsdal · Bruno de Moraes Dumont · Sanmi Koyejo

Keywords: [ Reasoning ] [ Benchmarks ] [ Large language models ] [ machine learning ] [ mathematical reasoning ] [ mathematics ]

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

As large language models (LLMs) continue to advance, many existing benchmarks designed to evaluate their reasoning capabilities are becoming saturated. Therefore, we present the Putnam-AXIOM Original benchmark consisting of 236 mathematical problems from the William Lowell Putnam Mathematical Competition, along with detailed step-by-step solutions. To preserve the Putnam-AXIOM benchmark's validity and mitigate potential data contamination, we created the Putnam-AXIOM Variation benchmark with functional variations of 52 problems. By programmatically altering problem elements like variables and constants, we can generate unlimited novel, equally challenging problems not found online. We see that almost all models have significantly lower accuracy in the variations than the original problems. Our results reveal that OpenAI's o1-preview, the best performing model, achieves merely 41.95\% accuracy on the Putnam-AXIOM Original but experiences around a 30\% reduction in accuracy on the variations' dataset when compared to corresponding original problems.

Chat is not available.

Poster in Workshop: MATH-AI: The 4th Workshop on Mathematical Reasoning and AI

Putnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning

Aryan Gulati · Brando Miranda · Eric Chen · Emily Xia · Kai Fronsdal · Bruno de Moraes Dumont · Sanmi Koyejo

Poster
in
Workshop: MATH-AI: The 4th Workshop on Mathematical Reasoning and AI