Oral
in
Workshop: Evaluating Evaluations: Examining Best Practices for Measuring Broader Impacts of Generative AI
Critical human-AI use scenarios and interaction modes for societal impact evaluations
Lujain Ibrahim · Saffron Huang · Lama Ahmad · Markus Anderljung
Keywords: [ sociotechnical evaluations ] [ LLM harms and risks ] [ human-AI interaction ]
Most real-world AI applications involve human-AI interaction, yet current evaluations, such as common benchmarks, do not. These evaluations typically assess the safety of models in isolation, thereby falling short of capturing the complexity of human-model interactions. While there are challenges in generalizing findings from human interaction evaluations at the individual-level to broader societal effects, such evaluations are nonetheless crucial for societal impact evaluation. They offer valuable insights into how AI systems affect individual users, which can inform interventions with significant societal implications. For instance, understanding how individuals engage with non-factual model outputs can guide effective labeling strategies for AI-generated content. This not only helps individuals recognize synthetic media but also addresses broader concerns about misinformation and trust. As human interaction evaluations become increasingly important, in this paper, we outline the evaluation scenarios and the human-model interaction modes the field needs to evaluate to better understand the societal impact of generative models.