NeurIPS Revisiting the Initial Steps in Adaptive Gradient Descent Optimization

Poster
in
Workshop: Optimization for ML Workshop

Revisiting the Initial Steps in Adaptive Gradient Descent Optimization

ABULIKEMU ABUDUWEILI · Changliu Liu

[ Abstract ] [ Project Page ]

[ Poster] [ OpenReview]

Abstract: Adaptive gradient optimization methods, such as Adam, are widely used for training deep neural networks across various machine learning tasks. While Adam typically leads to faster convergence, it often struggles with poor generalization compared to stochastic gradient descent (SGD), as it tends to get stuck in suboptimal local minima. To address this, several modifications have been introduced, such as bounding the step size or adjusting the second-order moment estimation.In this work, we show that this issue is closely related to the standard initialization of the second-order moment estimation $v_0 =0$. We propose a simpler yet effective solution: initializing the second-order moment estimation with non-zero random values. Our empirical results show that this adjustment not only stabilizes the convergence process but also enhances the final performance of the Adam optimizer. By using non-zero initialization, Adam achieves performance comparable to many of the recently proposed variants of adaptive gradient optimization methods.

Chat is not available.

Poster in Workshop: Optimization for ML Workshop

Revisiting the Initial Steps in Adaptive Gradient Descent Optimization

ABULIKEMU ABUDUWEILI · Changliu Liu

Poster
in
Workshop: Optimization for ML Workshop