Skip to yearly menu bar Skip to main content


Poster

SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models

Yu Yang · Siddhartha Mishra · Jeffrey Chiang · Baharan Mirzasoleiman

[ ]
Wed 11 Dec 11 a.m. PST — 2 p.m. PST

Abstract: Despite the effectiveness of data selection for large language models (LLMs) during pretraining and instruction fine-tuning phases, improving data efficiency in supervised fine-tuning (SFT) for specialized domains poses significant challenges due to the complexity of fine-tuning data. To bridge this gap, we introduce an effective and scalable data selection method for SFT, SmallToLarge (S2L), which leverages training trajectories from small models to guide the data selection for larger models. We theoretically prove that samples within the same loss trajectory cluster exhibit similar gradients during training, which justifies S2L's approach to efficiently approximate the full dataset's gradient by training only on selected subsets. We demonstrate through extensive experiments that S2L significantly improves data efficiency in SFT for mathematical problem-solving, reducing the training data to just $11$% of the original MathInstruct dataset to match full dataset performance while outperforming state-of-the-art data selection algorithms by an average of $4.7$% across 6 in- and out-domain evaluation datasets. Remarkably, selecting only 50K data for SFT, S2L achieves a $32.7$% accuracy on the challenging MATH benchmark, improving Phi-2 by $16.6$%. In clinical text summarization on the MIMIC-III dataset, S2L again outperforms training on the full dataset using only $50$% of the data. Notably, S2L can perform scalable data selection using a reference model $100\times$ smaller than the target model, proportionally reducing the computational cost.

Live content is unavailable. Log in and register to view live content