Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Algorithmic Fairness through the lens of Metrics and Evaluation

Towards Reliable Fairness Assessments of Multi-Label Image Classifiers

Melissa Hall · Bobbie Chern · Laura Gustafson · Denisse Ventura · Harshad Kulkarni · Candace Ross · Nicolas Usunier

Keywords: [ Case studies ] [ Evaluation Methods and Techniques ] [ Metrics and Evaluation ] [ Computer Vision ]

[ ]
[ Poster
 
presentation: Algorithmic Fairness through the lens of Metrics and Evaluation
Sat 14 Dec 9 a.m. PST — 5:30 p.m. PST

Abstract:

Benchmarks and metrics in computer vision have been used to measure the fairness of image classification systems, primarily for attribute or single-label classification. However, there is a lack of discussion on the vulnerabilities of these measurements for more complex computer vision tasks, such as recognition of many co-occurring classes. As vision models expand and become ubiquitous, it is even more important that our disparity assessments accurately reflect the true performance of models. In this paper, we demonstrate the impact of design choices in the observed fairness of vision models, using multi-label image classification as our test case. We identify several design choices, including the assignment of groups to images, mapping of classes between dataset and model, disaggregation of common metrics across groups, and balancing co-occurrence of labels, that look merely like implementation details but significantly impact the conclusions of assessments. Through two case studies, we show thatnaive implementations of these choices are brittle, impacting which concepts show a disparity and which demographic groups appear to be treated more favorably. Furthermore, concepts with large performance disparities between demographic groups tend to have varying definitions and representations between groups, with inconsistencies across datasets and annotators. We close with recommendations for model evaluation and dataset construction to increase the reliability of future fairness assessments of vision models.

Chat is not available.