Poster
in
Workshop: 5th Workshop on Self-Supervised Learning: Theory and Practice
A Graph Matching Approach to Balanced Data Sub-Sampling for Self-Supervised Learning
Hugues Van Assel · Randall Balestriero
Real-world datasets often display inherent imbalances in the distribution of classes or concepts. Recent studies indicate that such imbalances can lead to suboptimal performances of Self-Supervised Learning (SSL) models when evaluated across the full spectrum of concepts. To address this issue, we propose a data curation method that automatically selects a balanced subset of the data. This problem is approached as a graph matching task, where the goal is to identify a data subset that is most distinct in terms of pairwise similarities. We achieve this by mapping an isolated graph onto the similarity graph of the input data, leveraging the optimal transport semi-unbalanced Gromov-Wasserstein problem. We demonstrate that this problem can be solved with linear complexity and is well-suited for GPU acceleration. The effectiveness of our method is validated through experiments on small datasets, setting the stage for future exploration on larger-scale problems.