Poster
in
Workshop: Attributing Model Behavior at Scale (ATTRIB)
Loss-to-Loss Prediction: Language model scaling laws across datasets
David Brandfonbrener · Nikhil Anand · Nikhil Vyas · Eran Malach · Sham Kakade
While scaling laws provide a reliable methodology for predicting train loss across compute scales for a single data distribution, less is known about how to predict losses across distributions. In this paper, we derive a strategy for predicting one loss from another and apply it to predict across different pre-training datasets and from pre-training data to downstream task data. Our predictions extrapolate well even at 20x the FLOP budget used to fit the curves. More precisely, we find that there are simple shifted power law relationships between (1) the train losses of two models trained on two separate datasets when the models are paired by training compute, and (2) the train loss and the test loss on any downstream distribution for a single model. The results hold up for pre-training datasets that differ substantially (some are entirely code and others have no code at all) and across a variety of downstream tasks. Finally, we find that in some settings the shifted power law relationships can yield substantially more accurate predictions than extrapolating single-dataset scaling laws.