NeurIPS WILT: A Multi-turn, Memorization-Robust Inductive Logic Benchmark for LLMs

Poster
in
Workshop: MATH-AI: The 4th Workshop on Mathematical Reasoning and AI

WILT: A Multi-turn, Memorization-Robust Inductive Logic Benchmark for LLMs

Eryk Banatt · Jonathan Cheng · Tiffany Hwu

Keywords: [ Inductive Logic ] [ benchmark ] [ Overfitting Robustness ] [ mathematical reasoning ]

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract: While large language models (LLMs) have shown impressive capabilities across a wide range of domains, they still encounter significant challenges in reasoning tasks that require gathering evidence over multiple turns and drawing logical conclusions from this evidence. Despite the multi-turn nature of many real-world LLM use cases, most existing benchmarks rely on carefully curated single-turn tests, which often blur the line between memorization and genuine reasoning. To address this, we introduce the $\textbf{Wason Inductive Logic Test (WILT)}$, a simple yet challenging multi-turn reasoning benchmark designed to resist memorization. WILT is inspired by the Wason 2-4-6 task, where participants must infer a basic boolean function involving three variables (e.g., $x < y < z$) by proposing test cases (such as $(2, 4, 6)$). In WILT, each test starts from a clean slate, with only the initial instructions provided, preventing models from relying on pre-learned responses. Our findings reveal that LLMs struggle with this task, with the best-performing model achieving only 28% accuracy, highlighting a significant gap in LLM performance on complex multi-turn reasoning tasks.

Chat is not available.

Poster in Workshop: MATH-AI: The 4th Workshop on Mathematical Reasoning and AI

WILT: A Multi-turn, Memorization-Robust Inductive Logic Benchmark for LLMs

Eryk Banatt · Jonathan Cheng · Tiffany Hwu

Poster
in
Workshop: MATH-AI: The 4th Workshop on Mathematical Reasoning and AI