NeurIPS AIR-Bench 2024: Safety Evaluation Based on Risk Categories from Regulations and Policies

Oral
in
Workshop: Evaluating Evaluations: Examining Best Practices for Measuring Broader Impacts of Generative AI

AIR-Bench 2024: Safety Evaluation Based on Risk Categories from Regulations and Policies

Kevin Klyman

Keywords: [ regulation ] [ benchmarking ] [ foundation model ] [ AI safety ] [ AI policy ]

[ Abstract ]

Abstract:

Governments, companies, and researchers have proposed regulatory frameworks, acceptable use policies, and safety benchmarks in response to the risks of foundation models (FMs). However, existing public benchmarks often define safety categories based solely on previous literature or researchers’ intuitions, leading to risk categorizations that do not correspond to existing regulation or developers’ own policies and that make it challenging to compare FMs across benchmarks. To bridge this gap, we introduce AIR-BENCH 2024, among the first AI safety benchmarks explicitly drawn from government and company policies. AIR 2024 decomposes 8 government regulations and 16 company policies into a four-tiered safety taxonomy with 314 granular risk categories in the lowest tier. We examine the gap between the risks considered by leading AI safety benchmarks and those included in government and company policies, finding that these safety benchmarks address at most 71% of the higher level risk categories explicitly referenced in gov- ernment and company policies and do not address risks related to discrimination, NCII, or automated decision-making in high-risk economic sectors. In an effort to close this gap, we evaluate leading language models on AIR-Bench 2024, providing insights into how sensitive content is treated in different jurisdictions.

Chat is not available.

Oral in Workshop: Evaluating Evaluations: Examining Best Practices for Measuring Broader Impacts of Generative AI

AIR-Bench 2024: Safety Evaluation Based on Risk Categories from Regulations and Policies

Kevin Klyman

Oral
in
Workshop: Evaluating Evaluations: Examining Best Practices for Measuring Broader Impacts of Generative AI