Poster
in
Workshop: I Can’t Believe It’s Not Better (ICBINB): Failure Modes in the Age of Foundation Models
An Examination of the Robustness of Reference-Free Image Captioning Evaluation Metrics
Saba Ahmadi · Aishwarya Agrawal
Recently, reference-free metrics such as CLIPScore Hessel et al. (2021) and UMIC Lee et al. (2021) have been proposed for automatic evaluation of image captions. Our focus lies in evaluating the robustness of these metrics in scenarios that require distinguishing between two captions with high lexical overlap but very different meanings. Our findings reveal that despite their high correlation with human judgments, both CLIPScore and UMIC struggle to identify fine-grained errors. While both metrics exhibit strong sensitivity to visual grounding errors, their sensitivity to caption implausibility errors is limited. Furthermore, we found that both metrics are sensitive to variations in the size of image-relevant objects mentioned in the caption, while CLIPScore is also quite sensitive to the number of mentions of image-relevant objects in the caption. Regarding linguistic aspects of a caption, both metrics show weak comprehension of negation, UMIC is majorly impacted by the caption length, and CLIPScore is insensitive to the structure of the caption to a great extent. We hope our findings will guide further improvements in reference-free evaluation of image captioning.