NeurIPS A Graph Matching Approach to Balanced Data Sub-Sampling for Self-Supervised Learning

Poster
in
Workshop: 5th Workshop on Self-Supervised Learning: Theory and Practice

A Graph Matching Approach to Balanced Data Sub-Sampling for Self-Supervised Learning

Hugues Van Assel · Randall Balestriero

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

Real-world datasets often display inherent imbalances in the distribution of classes or concepts. Recent studies indicate that such imbalances can lead to suboptimal performances of Self-Supervised Learning (SSL) models when evaluated across the full spectrum of concepts. To address this issue, we propose a data curation method that automatically selects a balanced subset of the data. This problem is approached as a graph matching task, where the goal is to identify a data subset that is most distinct in terms of pairwise similarities. We achieve this by mapping an isolated graph onto the similarity graph of the input data, leveraging the optimal transport semi-unbalanced Gromov-Wasserstein problem. We demonstrate that this problem can be solved with linear complexity and is well-suited for GPU acceleration. The effectiveness of our method is validated through experiments on small datasets, setting the stage for future exploration on larger-scale problems.

Chat is not available.

Poster in Workshop: 5th Workshop on Self-Supervised Learning: Theory and Practice

A Graph Matching Approach to Balanced Data Sub-Sampling for Self-Supervised Learning

Hugues Van Assel · Randall Balestriero

Poster
in
Workshop: 5th Workshop on Self-Supervised Learning: Theory and Practice