NeurIPS When All Options Are Wrong: Evaluating Large Language Model Robustness with Incorrect Multiple-Choice Options

Poster
in
Workshop: The First Workshop on Large Foundation Models for Educational Assessment

When All Options Are Wrong: Evaluating Large Language Model Robustness with Incorrect Multiple-Choice Options

Gracjan Góral · Emilia Wiśnios

[ Abstract ] [ Project Page ]

[ OpenReview]

Sun 15 Dec 12:25 p.m. PST — 2 p.m. PST

Abstract:

The ability of Large Language Models (LLMs) to identify multiple-choice questions that lack a correct answer is a crucial aspect of educational assessment quality and an indicator of their critical thinking skills. This paper investigates the performance of various LLMs on such questions, revealing that models experience, on average, a 55% reduction in performance when faced with questions lacking a correct answer. The study also highlights that Llama 3.1-405B demonstrates a notable capacity to detect the absence of a valid answer, even when explicitly instructed to choose one. The findings emphasize the need for LLMs to prioritize critical thinking over blind adherence to instructions and caution against their use in educational settings where questions with incorrect answers might lead to inaccurate evaluations. This research establishes a benchmark for assessing critical thinking in LLMs and underscores the ongoing need for model alignment to ensure their responsible and effective use in educational and other critical domains.

Chat is not available.

Poster in Workshop: The First Workshop on Large Foundation Models for Educational Assessment

When All Options Are Wrong: Evaluating Large Language Model Robustness with Incorrect Multiple-Choice Options

Gracjan Góral · Emilia Wiśnios

Poster
in
Workshop: The First Workshop on Large Foundation Models for Educational Assessment