Poster
in
Workshop: MATH-AI: The 4th Workshop on Mathematical Reasoning and AI
Not All LLM Reasoners Are Created Equal
Arian Hosseini · Alessandro Sordoni · Daniel Toyama · Aaron Courville · Rishabh Agarwal
Keywords: [ Compositional Generalization ] [ Reasoning Behavior ] [ Evaluating Reasoning ] [ LLM Reasoning ] [ mathematical reasoning ]
We study the depth of problem-solving capabilities of LLMs, and to what extent they perform mathematical reasoning in a compositional manner. To this end, we create a new benchmark by composing pairs of existing math word problems together so that the answer to the second problem depends on correctly answering the first problem. We measure the difference between the performance of solving each question independently and solving the compositional pairs as the reasoning gap of a model. Our findings reveal a significant reasoning gap in most frontier LLMs. This gap is more pronounced in smaller and more cost-efficient models. The objective of this study is not to introduce yet another benchmark, but rather to provide a case study aimed at gaining deeper insights into current models' reasoning abilities, and to reassess existing established training methods and benchmarks.