Skip to yearly menu bar Skip to main content


Poster

IDGen: Item Discrimination Induced Prompt Generation for LLM Evaluation

Fan Lin · Shuyi Xie · Yong Dai · Wenlin Yao · TianJiao Lang · Yu Zhang

[ ]
Thu 12 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

As Large Language Models (LLMs) become more capable of handling increasingly complex tasks, the evaluation set must keep pace with these advancements to ensure it remains sufficiently discriminative. Item Discrimination (ID) theory, which is widely used in educational assessment, measures the ability of individual test items to differentiate between high and low performers. Inspired by this theory, we propose an ID-induced prompt synthesis framework for evaluating LLMs so that the evaluation set continually updates and refines according to model abilities. Our data synthesis framework prioritizes both breadth and specificity. It can generate prompts that comprehensively evaluate the capabilities of LLMs while revealing meaningful performance differences between models, allowing for effective discrimination of their relative strengths and weaknesses across various tasks and domains.To produce high-quality data, we incorporate a self-correct mechanism into our generalization framework and develop two models to predict prompt discrimination and difficulty score to facilitate our data synthesis framework, contributing valuable tools to evaluation data synthesis research. We apply our generated data to evaluate five SOTA models.The results demonstrate that the data generated by our framework is more challenging and discriminative compared to previous works.We will release a dataset of over 3,000 carefully crafted prompts to facilitate evaluation research of LLMs.

Live content is unavailable. Log in and register to view live content