Skip to yearly menu bar Skip to main content


Poster Session
in
Workshop: Scientific Methods for Understanding Neural Networks

Alice in Wonderland: Simple Tasks Reveal Severe Generalization and Basic Reasoning Deficits in State-Of-the-Art Large Language Models

Marianna Nezhurina · Lucia Cipolina Kun · Mehdi Cherti · Jenia Jitsev

[ ] [ Project Page ]
Sun 15 Dec 4:30 p.m. PST — 5:30 p.m. PST

Abstract:

Large Language Models (LLMs) are often described as being instances of foundation models - that is, models that possess strong generalization and therefore transfer robustly across various tasks and conditions in few-show or zero-shot manner, while exhibiting scaling laws that predict generalization improvement when increasing the pre-training scale. These claims of strong generalization and advanced function enabling it like reasoning rely on measurements by various standardized benchmarks where state-of-the-art (SOTA) models score high. We demonstrate here a dramatic breakdown of generalization and basic reasoning of all SOTA models which claim strong function, including advanced models like GPT-4 or Claude 3 Opus trained at the largest scales, using a simple, short, conventional common sense problem formulated in concise natural language, easily solvable by humans (AIW problem). The breakdown is dramatic as it manifests in strong performance fluctuations across mild problem variations that should not affect problem solving at all, while also often expressing strong overconfidence in the wrong solutions, backed up by plausible sounding explanation-like confabulations. Various standard interventions in an attempt to get the right solution, like chain-of-thought prompting, or urging the models to reconsider the wrong solutions again by multi step re-evaluation, fail. We take these initial observations to the scientific and technological community to stimulate urgent re-assessment of the claimed capabilities of current generation of LLMs. Such re-assessment also requires common action to create standardized benchmarks that would allow proper detection of such deficits in generalization and reasoning that obviously manage to remain undiscovered by current state-of-the-art evaluation procedures. Code for reproducing experiments in the paper and raw experiments data can be found at https://anonymous.4open.science/r/AITW_anonymous-69A6

Chat is not available.