Poster
Analyzing & Reducing the Need for Learning Rate Warmup in Neural Network Optimization
Atli Kosson · Bettina Messmer · Martin Jaggi
Learning Rate Warmup is a popular heuristic for training neural networks, especially at larger batch sizes. By definition, it must aid training by downscaling early model updates, but what is not clear is: Why and by which criteria are early updates too large? This work empirically explores this question for small GPT training by assessing and controlling the update size via various metrics. We find the relative update magnitude of weight matrices to be particularly informative, explicitly controlling this simple quantity can sometimes be sufficient to replace warmup, especially when combined with high momentum. However, parameter based measures fail to account for changes in the "critical batch size" or signal-to-noise ratio of the gradient throughout training, which warmup can also help counteract. We show how quantifying the update magnitude in terms of neural representations offers a promising approach to address this and effectively control the update size, reducing the need for explicit warmup.
Live content is unavailable. Log in and register to view live content