Workshop
Evaluating Evaluations: Examining Best Practices for Measuring Broader Impacts of Generative AI
Avijit Ghosh · Usman Gohar · Yacine Jernite · Lucie-Aimée Kaffee · Alberto Lusoli · Jennifer Mickel · Irene Solaiman · Arjun Subramonian · Zeerak Talat · Felix Friedrich · Cedric Whitney · Michelle Lin
East Meeting Room 16
Sun 15 Dec, 9:15 a.m. PST
Generative AI systems are becoming increasingly prevalent in society across modalities, producing content such as text, images, audio, and video, with far-reaching implications. The NeurIPS Broader Impact statement has notably shifted norms for AI publications to consider negative societal impact. However, no standard exists for how to approach these impact assessments. While new methods for evaluation of social impact are being developed, including notably through the NeurIPS Datasets and Benchmarks track, the lack of standard for documenting their applicability, utility, and disparate coverage of different social impact categories stand in the way of broad adoption by developers and researchers of generative AI systems. By bringing together experts on the science and context of evaluation and practitioners who develop and analyze technical systems, we aim to help address this issue through the work of the NeurIPS community.
Schedule
Sun 9:15 a.m. - 9:30 a.m.
|
Opening Remarks
(
Oral
)
>
SlidesLive Video |
🔗 |
Sun 9:30 a.m. - 10:30 a.m.
|
Panel Discussion
(
Panel Discussion
)
>
SlidesLive Video |
🔗 |
Sun 10:30 a.m. - 11:30 a.m.
|
Oral Session 1
(
Oral
)
>
SlidesLive Video |
🔗 |
Sun 11:30 a.m. - 12:30 p.m.
|
Oral Session 2
(
Oral
)
>
SlidesLive Video |
🔗 |
Sun 12:30 p.m. - 2:30 p.m.
|
Lunch and Poster Session
(
Poster
)
>
|
🔗 |
Sun 2:30 p.m. - 3:00 p.m.
|
Oral Session 3
(
Oral
)
>
SlidesLive Video |
🔗 |
Sun 3:30 p.m. - 4:05 p.m.
|
Oral Session 3 (part 2 after break)
(
Oral
)
>
|
🔗 |
Sun 4:05 p.m. - 5:00 p.m.
|
What's Next - Coalition Development
(
Oral
)
>
SlidesLive Video |
🔗 |
Sun 5:00 p.m. - 5:30 p.m.
|
Closing Remarks
(
Oral
)
>
SlidesLive Video |
🔗 |
-
|
Surveying Surveys: Surveys’ Role in Evaluating AI’s Labor Market Impact
(
Poster
)
>
|
Cassandra Solis 🔗 |
-
|
Evaluating Generative AI Systems is a Social Science Measurement Challenge
(
Oral
)
>
|
20 presentersHanna Wallach · Meera Desai · Nicholas Pangakis · A. Feder Cooper · Angelina Wang · Solon Barocas · Alexandra Chouldechova · Chad Atalla · Su Lin Blodgett · Emily Corvi · Alex Dow · Jean Garcia-Gathright · Alexandra Olteanu · Stefanie Reed · Emily Sheng · Dan Vann · Jennifer Wortman Vaughan · Matthew Vogel · Hannah Washington · Abigail Jacobs |
-
|
Cascaded to End-to-End: New Safety, Security, and Evaluation Questions for Audio Language Models
(
Oral
)
>
|
Luxi He · Xiangyu Qi · Inyoung Cheong · Prateek Mittal · Danqi Chen · Peter Henderson 🔗 |
-
|
Can Vision-Language Models Replace Human Annotators: A Case Study with CelebA Dataset
(
Poster
)
>
|
Haoming Lu · Feifei Zhong 🔗 |
-
|
Evaluating Refusal
(
Oral
)
>
|
Shira Abramovich · Anna J. Ma 🔗 |
-
|
Fairness Dynamics During Training
(
Poster
)
>
|
Krishna Patel · Nivedha Sivakumar · Barry-John Theobald · Luca Zappella · Nicholas Apostoloff 🔗 |
-
|
Provocation on Expertise in Social Impact Evaluations for Generative AI (and Beyond)
(
Poster
)
>
|
Zoe Kahn · Nitin Kohli 🔗 |
-
|
Is ETHICS about ethics? Evaluating the ETHICS benchmark
(
Poster
)
>
|
Leif Hancox-Li · Borhane Blili-Hamelin 🔗 |
-
|
GenAI Evaluation Maturity Framework (GEMF) to assess and improve GenAI Evaluations
(
Oral
)
>
|
Yilin Zhang · Frank J. Kanayet 🔗 |
-
|
Democratic Perspectives and Corporate Captures of Crowdsourced Evaluations
(
Poster
)
>
|
parth sarin · Michelle Bao 🔗 |
-
|
Motivations for Reframing Large Language Model Benchmarking for Legal Applications
(
Poster
)
>
|
Riya Ranjan · Megan Ma 🔗 |
-
|
Evaluations Using Wikipedia without Data Leakage: From Trusting Articles to Trusting Edit Processes
(
Poster
)
>
|
Lucie-Aimée Kaffee · Isaac Johnson 🔗 |
-
|
Towards Leveraging News Media to Support Impact Assessment of AI Technologies
(
Poster
)
>
|
Mowafak Allaham · Kimon Kieslich · Nicholas Diakopoulos 🔗 |
-
|
AIR-Bench 2024: Safety Evaluation Based on Risk Categories from Regulations and Policies
(
Oral
)
>
|
Kevin Klyman 🔗 |
-
|
Rethinking Artistic Copyright Infringements in the Era of Text-to-Image Generative Models
(
Poster
)
>
|
Mazda Moayeri · Samyadeep Basu · Sriram Balasubramanian · Priyatham Kattakinda · Atoosa Chegini · Robert Brauneis · Soheil Feizi 🔗 |
-
|
(Mis)use of nude images in machine learning research
(
Oral
)
>
|
Arshia Arya · Princessa Cintaqia · Deepak Kumar · Allison McDonald · Lucy Qin · Elissa Redmiles 🔗 |
-
|
Critical human-AI use scenarios and interaction modes for societal impact evaluations
(
Oral
)
>
|
Lujain Ibrahim · Saffron Huang · Lama Ahmad · Markus Anderljung 🔗 |
-
|
Gaps Between Research and Practice When Measuring Representational Harms Caused by LLM-Based Systems
(
Poster
)
>
|
Emma Harvey · Emily Sheng · Su Lin Blodgett · Alexandra Chouldechova · Jean Garcia-Gathright · Alexandra Olteanu · Hanna Wallach 🔗 |
-
|
Dimensions of Generative AI Evaluation Design
(
Poster
)
>
|
Alex Dow · Jennifer Wortman Vaughan · Solon Barocas · Chad Atalla · Alexandra Chouldechova · Hanna Wallach 🔗 |
-
|
LLMs and Personalities: Inconsistencies Across Scales
(
Poster
)
>
|
Tommaso Tosato · David Lemay · Mahmood Hegazy · Irina Rish · Guillaume Dumas 🔗 |
-
|
JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark
(
Oral
)
>
|
Shota Onohara · Atsuyuki Miyai · Yuki Imajuku · Kazuki Egashira · Jeonghun Baek · Xiang Yue · Graham Neubig · Kiyoharu Aizawa 🔗 |
-
|
Contamination Report for Multilingual Benchmarks
(
Poster
)
>
|
Sanchit Ahuja · Varun Gumma · Sunayana Sitaram 🔗 |
-
|
Provocation: Who benefits from “inclusion” in Generative AI?
(
Oral
)
>
|
Samantha Dalal · Siobhan Mackenzie Hall · Nari Johnson 🔗 |
-
|
Troubling taxonomies in GenAI evaluation
(
Poster
)
>
|
Glen Berman · Ned Cooper · Wesley Deng · Ben Hutchinson 🔗 |
-
|
Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique
(
Poster
)
>
|
Suhas Hariharan · Zainab Ali Majid · Jaime Raldua Veuthey · Jacob Haimes 🔗 |
-
|
Statistical Bias in Bias Benchmark Design
(
Poster
)
>
|
Hannah Powers · Ioana Baldini · Dennis Wei · Kristin P Bennett 🔗 |
-
|
Assessing Bias in Metric Models for LLM Open-Ended Generation Bias Benchmarks
(
Poster
)
>
|
Nathaniel Demchak · Xin Guan · Zekun Wu · Ziyi Xu · Adriano Koshiyama · Emre Kazim 🔗 |
-
|
Using Scenario-Writing for Identifying and Mitigating Impacts of Generative AI
(
Poster
)
>
|
Kimon Kieslich · Nicholas Diakopoulos · Natali Helberger 🔗 |
-
|
A Framework for Evaluating LLMs Under Task Indeterminacy
(
Poster
)
>
|
Luke Guerdan · Hanna Wallach · Solon Barocas · Alexandra Chouldechova 🔗 |