Tutorial
Advancing Data Selection for Foundation Models: From Heuristics to Principled Methods
Jiachen (Tianhao) Wang · Ludwig Schmidt · Ruoxi Jia
West Ballroom C
Data selection is a critical step in training and fine-tuning foundation models, significantly impacting model performance and training efficiency. Existing approaches deployed in foundation models' data curation pipelines have primarily relied on heuristic methods, which, while practical, often lack a theoretical basis and can lead to suboptimal performance. This tutorial aims to bridge the gap between heuristic practices and emerging principled methods that offer systematic, theoretically grounded approaches to data selection.
We will begin by discussing the algorithmic foundations for data selection. This includes attribution-based approaches, diversity-based approaches, and methods that directly optimize for final model performance. These techniques will be introduced as instantiations of the unified framework of utility function maximization. Next, we will review the data selection techniques currently deployed in the foundation model training pipeline, such as rule-based data filtering, examining their strengths and limitations. Finally, we will introduce recent advances in developing principled data selection methods for foundation models, including both data point-level and source-level data selection. By the end of this tutorial, attendees will gain a deeper understanding of the theoretical underpinnings of data selection, practical knowledge of current data selection heuristics for foundation models, and insights into the research frontier in principled data selection techniques.