NeurIPS From Gradient Clipping to Normalization for Heavy Tailed SGD

Poster
in
Workshop: Optimization for ML Workshop

From Gradient Clipping to Normalization for Heavy Tailed SGD

Florian Hübler · Ilyas Fatkhullin · Niao He

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract: Recent empirical evidence indicates that many machine learning applications involve heavy-tailed distributions of gradient noise, which challenges the standard assumptions of bounded variance in stochastic optimization. Gradient clipping has emerged as a popular tool to handle this heavy-tailed noise, as it achieves good performance in this setting both theoretically and practically. However, our theoretical understanding of gradient clipping is constrained by the need for large, predefined clipping thresholds and its sensitivity to accurate estimation of tail indices. In contrast, practitioners often rely on small clipping thresholds without index estimation, which contradicts theoretical insights. To address these discrepancies, we empirically observe that an optimally tuned clipping threshold eventually clips gradients at every iteration in language modeling tasks, hence effectively reduces gradient clipping to Normalized SGD (NSGD). This observation suggests that the empirical success of clipped SGD is more accurately explained by the behavior of NSGD. We show that, indeed, NSGD is robust to misspecification of the tail index, aligning with empirical observations. Furthermore, we establish optimal convergence rates for finding an $\varepsilon$-stationary point with tuned stepsizes, supporting the algorithms empirical success. Finally, we advance the understanding of NSGD by proving a high-probability convergence result with a mild logarithmic dependence on the failure probability.

Chat is not available.

Poster in Workshop: Optimization for ML Workshop

From Gradient Clipping to Normalization for Heavy Tailed SGD

Florian Hübler · Ilyas Fatkhullin · Niao He

Poster
in
Workshop: Optimization for ML Workshop