Poster
in
Workshop: Optimization for ML Workshop
Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks
Shikai Qiu · Atish Agarwala · Lechao Xiao · Jeffrey Pennington
Studies of scaling ladders have shown that the compute-optimal Pareto frontier of a family of training curves can have a predictable shape, often some kind of power law. We use a series of small transformer models to demonstrate that the full learning curves themselves have a consistent shape — collapsing onto a single universal curve after a simple rescaling. Surprisingly, the deviations in the cross-model-size, rescaled curves are smaller than deviations induced from randomness initialization and data ordering in the raw learning curves, a phenomenon we call supercollapse. We recreate this phenomenon in the simple setting of MLP regression on a synthetic dataset. By analyzing both the original model and our simplified model, we identify necessary conditions for supercollapse, including compute-optimal training with power law loss-compute Pareto frontier, learning rate decay, and the right fitting procedure for irreducible loss. Our study hints at a broader, dynamical universality induced by compute-optimal scaling procedures.