Spotlight Poster
Scaling Laws and Compute-Optimal Training without Fixed Training Duration
Alex Hägele · Elie Bakouch · Atli Kosson · Loubna Ben allal · Leandro Von Werra · Martin Jaggi
Scale has become a main ingredient in obtaining strong machine learning models. As a result, understanding a model's scaling properties is key to effectively designing both the right training setup as well as future generations of architectures. In this work, we argue that scale and training research has been needlessly complex due to reliance on the cosine schedule, which prevents training across different lengths for the same model size. We investigate the training behavior of a direct alternative -- constant learning rate and cooldowns -- and find that it scales predictably and reliably similar to cosine. Additionally, we show that stochastic weight averaging yields almost optimal models along the training trajectory, without additional training costs, across different scales.With these findings, we demonstrate that scaling experiments can be done with significantly reduced compute and GPU hours by utilizing fewer but reusable training runs.
Live content is unavailable. Log in and register to view live content