Contributed Talks
in
Workshop: Optimization for ML Workshop
Talk 1: *MindFlayer: Efficient Asynchronous Parallel SGD in the Presence of Heterogeneous and Random Worker Compute Times* and Talk 2: *Provable non-accelerations of the heavy-ball method*
Artavazd Maranjyan · Gauthier Gidel
Talk 1: MindFlayer: Efficient Asynchronous Parallel SGD in the Presence of Heterogeneous and Random Worker Compute Times, Artavazd Maranjyan
Abstract: We study the problem of minimizing the expectation of smooth nonconvex functions with the help of several parallel workers whose role is to compute stochastic gradients. In particular, we focus on the challenging situation where the workers' compute times are arbitrarily heterogeneous and random. In the simpler regime characterized by arbitrarily heterogeneous but deterministic compute times, Tyurin and Richt'{a}rik (NeurIPS 2023) recently designed the first theoretically optimal asynchronous SGD method, called Rennala SGD, in terms of a novel complexity notion called time complexity. The starting point of our work is the observation that Rennala SGD can have arbitrarily bad performance in the presence of random compute times -- a setting it was not designed to handle. To advance our understanding of stochastic optimization in this challenging regime, we propose a new asynchronous SGD method, for which we coin the name MindFlayer SGD. Our theory and empirical results demonstrate the superiority of MindFlayer SGD over existing baselines, including Rennala SGD, in cases when the noise is heavy tailed.
Talk 1: Provable non-accelerations of the heavy-ball method, Speaker: Gauthier Gidel, Authors: Baptiste Goujaud, Adrien Taylor, Aymeric Dieuleveut
Abstract: In this work, we show that the heavy-ball (HB) method provably does not reach an accelerated con- vergence rate on smooth strongly convex problems. More specifically, we show that for any condition number and any choice of algorithmic parameters, either the worst-case convergence rate of HB on the class of L-smooth and μ-strongly convex quadratic functions is not accelerated (that is, slower than 1 − O(κ)), or there exists an L-smooth μ-strongly convex function and an initialization such that the method does not converge.
To the best of our knowledge, this result closes a simple yet open question on one of the most used and iconic first-order optimization technique.
Our approach builds on finding functions for which HB fails to converge and instead cycles over finitely many iterates. We analytically describe all parametrizations of HB that exhibit this cycling behavior on a particular cycle shape, whose choice is supported by a systematic and constructive approach to the study of cycling behaviors of first-order methods. We show the robustness of our results to perturbations of the cycle, and extend them to class of functions that also satisfy higher-order regularity conditions.