NeurIPS Poster SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models

Poster

SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models

Yu Yang · Siddhartha Mishra · Jeffrey Chiang · Baharan Mirzasoleiman

West Ballroom A-D #5401

[ Abstract ]

[ Paper] [ OpenReview]

Fri 13 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract: Despite the effectiveness of data selection for pretraining and instruction fine-tuninglarge language models (LLMs), improving data efficiency in supervised fine-tuning(SFT) for specialized domains poses significant challenges due to the complexityof fine-tuning data. To bridge this gap, we introduce an effective and scalabledata selection method for SFT, SmallToLarge (S2L), which trains a smallmodel, clusters loss trajectories of the examples, and samples from these clusters toguide data selection for larger models. We prove that during fine-tuning, sampleswithin the same loss trajectory cluster exhibit similar gradients. Then, we showthat S2L subsets have a bounded gradient error w.r.t. the full data, hence guaranteeconvergence to the neighborhood of the optimal solution. We demonstrate throughextensive experiments that S2L significantly improves data efficiency in SFT formathematical problem-solving, reducing the training data requirement to just $11$%of the original MathInstruct dataset to match full dataset performance whileoutperforming state-of-the-art data selection algorithms by an average of $4.7$%across $6$ in- and out-domain evaluation datasets. Remarkably, selecting only 50Kdata for SFT, S2L achieves a $32.7$% accuracy on the challenging MATHbenchmark, improving Phi-2 by $16.6$%. In clinical text summarization on theMIMIC-III dataset, S2L again outperforms training on the full dataset usingonly $50$% of the data. Notably, S2L can perform scalable data selection using areference model $100\times$ smaller than the target model, proportionally reducing thecomputational cost.

Chat is not available.