Poster
in
Workshop: Evaluating Evaluations: Examining Best Practices for Measuring Broader Impacts of Generative AI
Statistical Bias in Bias Benchmark Design
Hannah Powers · Ioana Baldini · Dennis Wei · Kristin P Bennett
Keywords: [ confounding ] [ omitted variable bias ] [ statistical bias ] [ experimental design ]
Social bias benchmarks lack a consistent framework to standardize practices and most center their design and creation around a subset of biases aimed at testing whether language models have these social biases. Current work also often overlooks statistical biases in the benchmark. When unaccounted for, these can cause inaccurate conclusions in the benchmark result analysis. We advocate considering benchmark creation as a multi-factor problem. To support this perspective, we advocate for an experimental approach inspired by health informatics. We recommend researchers be aware of the potential for statistical biases during benchmark design and analysis. We demonstrate the importance of formalizing explanatory factors and give examples of the presence of statistical biases and their possible effects with BBQ.