Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Evaluating Evaluations: Examining Best Practices for Measuring Broader Impacts of Generative AI

Gaps Between Research and Practice When Measuring Representational Harms Caused by LLM-Based Systems

Emma Harvey · Emily Sheng · Su Lin Blodgett · Alexandra Chouldechova · Jean Garcia-Gathright · Alexandra Olteanu · Hanna Wallach

Keywords: [ large language models ] [ representational harms ] [ interviews ] [ measurement ]


Abstract:

As the availability and adoption of large language model (LLM)-based systems have increased, so has their potential to cause representational harms. Tools, datasets, metrics, and benchmarks for measuring these harms - collectively, harm measurement instruments - are therefore critical to the responsible development and deployment of such systems. While the NLP research community has produced a rich repository of publicly available harm measurement instruments, we do not yet have clarity into whether these instruments actually meet the needs of practitioners. Through a series of semi-structured interviews with AI practitioners (N=12) in a variety of roles in several different types of organizations, we identify four types of challenges that prevent practitioners from effectively using publicly available instruments for measuring representational harms: (1) challenges inherent to publicly available measurement instruments; (2) challenges related to doing measurement in practice; (3) challenges arising from measurement tasks involving LLM-based systems; and (4) challenges specific to measuring representational harms. Our goal is to advance the development of instruments for measuring representational harms that are well-suited to practitioner needs, thus better facilitating the responsible development and deployment of LLM-based systems.

Chat is not available.