Poster
in
Workshop: Attributing Model Behavior at Scale (ATTRIB)
Just Select Twice: Leveraging Low Quality Data to Improve Data Selection
Yifei Zhang · Yusen Jiao · Jiayi Chen · Jieyu Zhang · Frederic Sala
Data valuation is crucial for assessing the impact and quality of individual data points, enabling the ranking of data by importance for efficient data collection, storage, and training. Many data valuation methods are sensitive to outliers and require a certain level of noise to effectively distinguish low-quality data from high-quality data, making them particularly useful for data removal tasks. Especially, for instance, optimal transport based method exhibits notable performance in outlier detection but shows only moderate effectiveness in high-quality data selection, attributed to its property of sensitivity to outliers and insensitivity to small variations. To mitigate the issue of insensitivity to high-quality data and facilitate effective data selection, in this paper, we propose a straightforward two-stage approach, JST, that initially performs data valuation as usual, followed by a second-round data selection where the identified low-quality data points are designated as the validation set to perform data valuation again. In this way, high-quality data become outliers with the respect to new validation set and can be naturally popped out. We empirically evaluate our framework instantiated with optimal transport based method for data selection and data pruning on several standard datasets and our framework demonstrates superior performance compared to pure data valuation, especially under the condition with small noise. Additionally, we show the general applicability of our framework to influence function based and reinforcement learning based data valuation methods.