Poster
in
Workshop: OPT 2023: Optimization for Machine Learning
On the Synergy Between Label Noise and Learning Rate Annealing in Neural Network Training
Stanley Wei · Tongzheng Ren · Simon Du
In the past decade, stochastic gradient descent (SGD) has emerged as one of the most dominant algorithms in neural network training, with enormous success in different application scenarios. However, the implicit bias of SGD with different training techniques is still under-explored. Some of the common heuristics in practice include 1) using large initial learning rates and decaying it as the training progresses, and 2) using mini-batch SGD instead of full-batch gradient descent. In this work, we show that under certain data distributions, these two techniques are both necessary to obtain good generalization on neural networks. We consider mini-batch SGD with label noise, and at the heart of our analysis lies the concept of feature learning order, which has previously been characterized theoretically by Li et al. (2019) and Abbe et al. (2021). Notably, we use this to give the first concrete separations in generalization guarantees, between training neural networks with both label noise SGD and learning rate annealing and training with one of these elements removed.