Skip to yearly menu bar Skip to main content


Poster

Intrinsic Self-Supervision for Data Quality Audits

Fabian Gröger · Simone Lionetti · Philippe Gottfrois · Alvaro Gonzalez-Jimenez · Ludovic Amruthalingam · Matthew Groh · Alexander Navarini · Marc Pouly

[ ] [ Project Page ]
Thu 12 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

Benchmark datasets in computer vision often contain off-topic images, near duplicates, and label errors, leading to inaccurate estimates of model performance. In this paper, we revisit the task of data cleaning and formalize it as either a ranking problem, which significantly reduces human inspection effort, or a scoring problem, which allows for automated decisions based on score distributions. We find that a specific combination of context-aware self-supervised representation learning and distance-based indicators is effective in finding issues without annotation biases. This methodology, which we call SelfClean, surpasses state-of-the-art performance in detecting off-topic images, near duplicates, and label errors within widely-used computer vision datasets, such as ImageNet-1k, Food-101N, and STL-10, both for synthetic issues and real contamination. We apply the detailed method to multiple image benchmarks, identify up to 16% of issues, and confirm an improvement in evaluation reliability upon cleaning.

Live content is unavailable. Log in and register to view live content