NeurIPS Poster Reproducibility Study of “Quantifying Societal Bias Amplification in Image Captioning”

Poster

Reproducibility Study of “Quantifying Societal Bias Amplification in Image Captioning”

Farrukh Baratov · Goksenin Yuksel · Darie Petcu · Jan Bakker

Great Hall & Hall B1+B2 (level 1) #2000

[ Abstract ]

[ Poster] [ OpenReview]

Abstract:

Scope of reproducibility - We study the reproducibility of the paper "Quantifying Societal Bias Amplification in Image Captioning" by Hirota et al. In this paper, the authors propose a new metric to measure bias amplification, called LIC, and evaluate it on multiple image captioning models. Based on this evaluation, they make the following main claims which we aim to verify: (1) all models amplify gender bias, (2) all models amplify racial bias, (3) LIC is robust against encoders, and (4) the NIC+Equalizer model increases gender bias with respect to the baseline. We also extend upon the original work by evaluating LIC for age bias.Methodology - For our reproduction, we were able to run the code provided by the authors without any modifications. For our extension, we automatically labelled the images in the dataset with age annotations and adjusted the code to work with this dataset. In total, 38 GPU hours were needed to perform all experiments.Results - The reproduced results are close to the original results and support all four main claims. Furthermore, our additional results show that only a subset of the models amplifies age bias, while they strengthen the claim that LIC is robust against encoders. However, we acknowledge that our extension to age bias has its limitations.What was easy - The author's code and the data needed to run it are publicly available. The code required no modification to run and the scripts were provided with an extensive argument parser, allowing us to quickly set up our experiments. Moreover, the details of the original experiments were clearly stated in the appendix.What was difficult - We found that it was difficult to interpret the author's code as the provided documentation contained room for improvement. Also, the scripts contained repetitive code. While the authors retrained all image captioning models, they did not share the model weights, making it difficult to extend upon their work.Communication with original authors - No (attempt at) communication with the original authors was performed.

Chat is not available.