Stochastic Gradient Descent (SGD) is the workhorse of most modern ML based solutions. The method was discovered first in 1951 (Robbins-Monro algorithm) and has since generated tremendous interest and impact especially for training deep neural networks. However, there is a significant gap between the practical versions of SGD and the stylized versions used for theoretical understanding. In this tutorial, we will highlight some of the gaps and present a few latest results that attempt at bridging it. In particular, we will talk how we can rigorously understand the following practical variants of standard SGD: a) SGD + mini-batching, b) SGD + acceleration, c) SGD with random reshuffling, d) SGD with last point iterate.
The tutorial is based on joint works with Praneeth Netrapalli, Sham Kakade, Rahul Kidambi, Dheeraj Nagaraj, Aaron Sidford.