Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Optimization for ML Workshop

Understanding Adam Requires Better Rotation Dependent Assumptions

Tianyue Zhang · Lucas Maes · Charles Guille-Escuret · Alexia Jolicoeur-Martineau · Ioannis Mitliagkas · Simon Lacoste-Julien · Damien Scieur


Abstract:

Despite its widespread adoption, Adam's advantages in training large language models lack a comprehensive theoretical explanation. This paper investigates Adam's sensitivity to rotations of the parameter space, a property distinguishing it from rotation-invariant optimizers like Stochastic Gradient Descent. We demonstrate that Adam's performance in training transformers degrades under random rotations of the objective function, indicating a crucial dependence on the standard basis. This reveals that conventional rotation-invariant assumptions are insufficient to capture Adam's advantages theoretically. We examine a few rotation-dependent properties in the literature, evaluating them across various rotation types and demonstrating their inadequacy in explaining Adam's behavior. This work highlights the need for new, basis-dependent theoretical frameworks to fully understand Adam's empirical success in modern machine learning tasks.

Chat is not available.