Poster
in
Workshop: Mathematics of Modern Machine Learning (M3L)
Good regularity creates large learning rate implicit biases: edge of stability, balancing, and catapult
Yuqing Wang · Zhenghao Xu · Tuo Zhao · Molei Tao
Large learning rates, when applied to gradient descent for nonconvex optimization, yield various implicit biases including edge of stability (Cohen et al., 2021), balancing (Wang et al., 2022), and catapult (Lewkowycz et al., 2020). These phenomena cannot be well explained by classical optimization theory. Significant theoretical progress has been made to understand these implicit biases, but it remains unclear for which objective functions would they occur. This paper provides an initial step in answering this question, showing that these implicit biases are different tips of the same iceberg. Specifically, they occur when the optimization objective function has certain regularity. This regularity, together with gradient descent using a large learning rate that favors flatter regions, result in these nontrivial dynamical behaviors. To demonstrate this claim, we develop new global convergence theory under large learning rates for two examples of nonconvex functions without global smoothness, departing from typical assumptions in traditional analyses. We also discuss the implications on training neural networks, where different losses and activations can affect regularity and lead to highly varied training dynamics.