Oral
in
Workshop: Evaluating Evaluations: Examining Best Practices for Measuring Broader Impacts of Generative AI
Evaluating Generative AI Systems is a Social Science Measurement Challenge
Hanna Wallach · Meera Desai · Nicholas Pangakis · A. Feder Cooper · Angelina Wang · Solon Barocas · Alexandra Chouldechova · Chad Atalla · Su Lin Blodgett · Emily Corvi · Alex Dow · Jean Garcia-Gathright · Alexandra Olteanu · Stefanie Reed · Emily Sheng · Dan Vann · Jennifer Wortman Vaughan · Matthew Vogel · Hannah Washington · Abigail Jacobs
Keywords: [ Validity ] [ Generative AI ] [ Measurement ] [ Evaluation ] [ Social Sciences ] [ Risks ] [ Measurement Theory ] [ Capabilities ]
Across academia, industry, and government, there is an increasing awareness that evaluating generative AI (GenAI) systems is challenging, as concepts related to their capabilities (e.g., intelligence, reasoning) and risks (e.g., stereotyping, anthropomorphism) are especially difficult to measure. We argue that the kinds of difficult measurement tasks involved in evaluating GenAI systems are highly reminiscent of the kinds of measurement tasks found throughout the social sciences. With this in mind, we present a framework, grounded in measurement theory from the social sciences, for measuring concepts related to the capabilities and risks of GenAI systems. The framework distinguishes between four levels: the background concept, the systematized concept, the measurement instrument(s), and the instance-level measurements. This four-level approach differs from the way measurement is typically done in ML, where researchers and practitioners appear to jump straight from background concepts to measurement instruments, with little to no systematization of those background concepts in between. As well as surfacing assumptions, thereby making it easier to understand exactly what the resulting measurements do and do not mean, this framework has two important implications for evaluating evaluations: First, it can enable stakeholders from different worlds to participate in conceptual debates, thereby broadening the expertise involved in evaluating GenAI systems. Second, it brings rigor to operational debates by offering a set of lenses for interrogating the validity of measurement instruments and their resulting measurements.