Poster
in
Workshop: Mathematics of Modern Machine Learning (M3L)
Implicit Bias of Adam versus Gradient Descent in One-Hidden-Layer Neural Networks
Bhavya Vasudeva · Vatsal Sharan · Mahdi Soltanolkotabi
Keywords: [ optimization ] [ adam ] [ gradient descent ] [ implicit bias ]
Adam is the de facto optimization algorithm for training deep neural networks, but understanding its implicit bias and how it differs from other algorithms, particularly standard gradient descent (GD), remains limited. We investigate the differences in the implicit biases of Adam and GD when training one-hidden-layer ReLU neural networks on a binary classification task using a synthetic data setting with diverse features. We find that GD exhibits a simplicity bias, resulting in a linear decision boundary, whereas Adam leverages diverse features, producing a nonlinear boundary that is closer to the Bayes optimal predictor. We theoretically prove this for a simple data setting in the infinite width regime by analyzing the population gradients. Our results offer important insights towards improving the understanding of Adam, which can aid the design of optimization algorithms with superior generalization.