Poster
in
Workshop: Statistical Frontiers in LLMs and Foundation Models
A shared standard for valid measurement of generative AI systems' capabilities, risks, and impacts
Alexandra Chouldechova · Chad Atalla · Solon Barocas · A. Feder Cooper · Emily Corvi · Alex Dow · Jean Garcia-Gathright · Nicholas Pangakis · Stefanie Reed · Emily Sheng · Dan Vann · Matthew Vogel · Hannah Washington · Hanna Wallach
Keywords: [ generative AI ] [ systematization ] [ evaluation framework ] [ evaluation ] [ measurement theory ] [ measurement framework ] [ conceptualization ]
The valid measurement of generative AI (GenAI) systems' capabilities, risks, and impacts forms the bedrock of our ability to evaluate these systems. We introduce a shared standard for valid measurement of GenAI systems' capabilities, risks, and impacts to advance GenAI evaluations from their current state of disparate-seeming and ad hoc practices to formalized and theoretically grounded processes---i.e., a science. Our framework, grounded in measurement theory from the social sciences, extends the work of Adcock & Collier (2001) in which the authors formalized valid measurement of concepts in political science via three processes: systematizing background concepts, operationalizing systematized concepts via annotation procedures, and applying those procedures to instances. We argue that valid measurement of GenAI systems' capabilities, risks, and impacts further requires systematizing, operationalizing, and applying not only concepts, but also contexts and metrics. This involves both descriptive reasoning about particular instances or data sets, and inferential reasoning about underlying populations, which is the purview of statistics. Our framework is applicable to any measurement task that can be templatized as measuring the [amount] of a [concept] in a [population] of [instances]. Our framework places the many disparate-seeming approaches to evaluation of GenAI systems on a common footing, enabling individual evaluations to be understood, interrogated for reliability and validity, and meaningfully compared.