Skip to yearly menu bar Skip to main content


Spotlight
in
Workshop: OPT 2021: Optimization for Machine Learning

Better Linear Rates for SGD with Data Shuffling

Grigory Malinovsky · Alibek Sailanbayev · Peter Richtarik


Abstract:

Virtually all state-of-the-art methods for training supervised machine learning models are variants of SGD, enhanced with a number of additional tricks, such as minibatching, momentum, and adaptive stepsizes. However, one of the most basic questions in the design of successful SGD methods, one that is orthogonal to the aforementioned tricks, is the choice of the {\em next} training data point to be learning from. Standard variants of SGD employ a {\em sampling with replacement} strategy, which means that the next training data point is sampled from the entire data set, often independently of all previous samples. While standard SGD is well understood theoretically, virtually all widely used machine learning software is based on {\em sampling without replacement} as this is often empirically superior. That is, the training data is randomly shuffled/permuted, either only once at the beginning, strategy known as {\em random shuffling} (RS), or before every epoch, strategy known as {\em random reshuffling} (RR), and training proceeds in the data order dictated by the shuffling. RS and RR strategies have for a long time remained beyond the reach of theoretical analysis that would satisfactorily explain their success. However, very recently, Mishchenko et al (2020) provided tight {\em sublinear} convergence rates through a novel analysis, and showed that these strategies can improve upon standard SGD in certain regimes. Inspired by these results, we seek to further improve the rates of shuffling-based methods. In particular, we show that it is possible to enhance them with a variance reduction mechanism, obtaining {\em linear} convergence rates in the strongly convex regime. To the best of our knowledge, our linear convergence rates are the best for any method based on sampling without replacement. Moreover, we obtained improved convergence guarantees in convex and non-convex regimes as well.