NeurIPS Poster Analyzing & Reducing the Need for Learning Rate Warmup in GPT Training

Poster

Analyzing & Reducing the Need for Learning Rate Warmup in GPT Training

Atli Kosson · Bettina Messmer · Martin Jaggi

East Exhibit Hall A-C #1810

[ Abstract ]

[ Paper] [ Poster] [ OpenReview]

Thu 12 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract: Learning Rate Warmup is a popular heuristic for training neural networks, especially at larger batch sizes, despite limited understanding of its benefits. Warmup decreases the update size $\Delta \mathbf{w}_t = \eta_t \mathbf{u}_t$ early in training by using lower values for the learning rate $\eta_t$. In this work we argue that warmup benefits training by keeping the overall size of $\Delta \mathbf{w}_t$ limited, counteracting large initial values of $\mathbf{u}_t$. Focusing on small-scale GPT training with AdamW/Lion, we explore the following question: *Why and by which criteria are early updates $\mathbf{u}_t$ too large?* We analyze different metrics for the update size including the $\ell_2$-norm, resulting directional change, and impact on the representations of the network, providing a new perspective on warmup. In particular, we find that warmup helps counteract large angular updates as well as a limited critical batch size early in training. Finally, we show that the need for warmup can be significantly reduced or eliminated by modifying the optimizer to explicitly normalize $\mathbf{u}_t$ based on the aforementioned metrics.

Chat is not available.