Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Evaluating Evaluations: Examining Best Practices for Measuring Broader Impacts of Generative AI

Statistical Bias in Bias Benchmark Design

Hannah Powers · Ioana Baldini · Dennis Wei · Kristin P Bennett

Keywords: [ confounding ] [ omitted variable bias ] [ statistical bias ] [ experimental design ]


Abstract:

Social bias benchmarks lack a consistent framework to standardize practices and most center their design and creation around a subset of biases aimed at testing whether language models have these social biases. Current work also often overlooks statistical biases in the benchmark. When unaccounted for, these can cause inaccurate conclusions in the benchmark result analysis. We advocate considering benchmark creation as a multi-factor problem. To support this perspective, we advocate for an experimental approach inspired by health informatics. We recommend researchers be aware of the potential for statistical biases during benchmark design and analysis. We demonstrate the importance of formalizing explanatory factors and give examples of the presence of statistical biases and their possible effects with BBQ.

Chat is not available.