Sanjay Shakkottai, UT Austin

Title: On the Throughput vs Accuracy Trade-Off for Streaming Unsupervised Classification

We study an online and streaming unsupervised classification system. Our setting consists of a collection of classifiers (with unknown confusion matrices) each of which can classify one sample per unit time, and which are accessed by a stream of unlabeled samples. Each sample is dispatched to one or more classifiers, and depending on the labels collected from these classifiers, may be sent to other classifiers to collect additional labels. The labels are continually aggregated. Once the aggregated label has high enough accuracy (a pre-specified threshold for accuracy) or the sample is sent to all the classifiers, the now labeled sample is ejected from the system. For any given pre-specified threshold for accuracy, the objective is to sustain the maximum possible rate of arrival of new samples, such that the number of samples in memory does not grow unbounded.

In this talk, we characterize the Pareto-optimal region of accuracy and arrival rate, and discuss an algorithm that can operate at any point within this region. Our algorithm uses queueing-based routing and scheduling approaches combined with online tensor decomposition method to learn the hidden parameters, and provides Pareto-optimality guarantees. We finally verify our theoretical results through simulations on two ensembles formed using AlexNet, VGG, and ResNet deep image classifiers. Based on joint work with Soumya Basu, Steven Gutstein and Brent Lance.


Shivani Agarwal, U Penn

Title: Designing Statistically Consistent Learning Algorithms: The Role of Surrogate Loss Functions

For any machine learning problem, a basic desideratum of a good learning algorithm is that as it gets more and more training data, the performance of the learned model should converge to that of a Bayes optimal model -- in other words, the learning algorithm should be statistically consistent. How, then, can one design statistically consistent learning algorithms that are also computationally efficient?

In particular, in many learning problems where directly optimizing the target performance measure is computationally difficult, a practical approach to designing computationally efficient learning algorithms is that of minimizing a convex surrogate loss. A natural question that arises then is: Can such surrogate losses provide statistical consistency guarantees?

This tutorial will provide answers to some of these questions. In particular, we will review the theory of convex calibrated surrogate losses, which lead to statistically consistent learning algorithms, as well as "proper" surrogate losses and their connections with property elicitation. We will also discuss recent results establishing that all calibrated surrogates can effectively be viewed as proper surrogates that estimate a suitable "sufficient" property of the data distribution. With these tools, the task of designing of a statistically consistent learning algorithm reduces to the task of identifying a suitable "sufficient" property for the target learning problem, and then designing a proper surrogate loss for estimating this property.


Aaditya Ramdas, CMU

Title: Assumption-free uncertainty quantification for black-box algorithms

The fields of statistics and machine learning have made tremendous progress in the last few decades in designing accurate black-box prediction methods (boosting, random forests, bagging, neural nets, etc.) but for deployment in the real world, it would be useful to have uncertainty quantification for those point-predictions. In this tutorial, I will summarize recent work that my collaborators and I have done over the last few years, for designing a large class of such methods that work without any assumptions on the algorithm, or the distribution of the covariates, or the distribution of the labels, etc, just relying on “exchange ability” of the test and training points.

This talk is based on a sequence of joint works by the BaCaRaTi group (Rina Barber, Emmanuel Candes, myself, Ryan Tibshirani), + some recent work with a collaborator, Arun Kumar Kuchibhotla, and my student Chirag Gupta.


Sanjoy Mitter, MIT

Title:  Information and Entry Flow in Statistical Filtering.

In this talk I discuss Statistical Filtering as an Informationally Dissipative System leading to a relation between Shannon Information and fisher Information. When viewed as a problem in Bayesian Inference the Optimal Fileter (Conditional Distribution ) corresponds to a variational Principle related to Free Energy Minimization. These ideas are related to the recent work of Otto, Jordan and Kinderlehrer on the variational Principle for the Fokker Planck Equation.


Wouter Koolen, CWI

Title: Exploration and Exploitation in Structured Stochastic Bandits

How can we design methods for interactive learning? In this tutorial we will model the environment as a  stochastic multi-armed bandit, and we will consider two classical tasks: regret minimization (reward maximization) and active testing (pure exploration). For each of these tasks, we will first study the limits of learning by developing information-theoretic lower bounds. These lower bounds reveal that the complexity of our sequential learning problems is governed by certain two-player zero-sum games. This games perspective allows us to develop practical algorithms based on iterative saddle point solvers. The methods that we obtain are general, powerful and flexible. They can be made to incorporate prior knowledge about the world in the form of structured bandit constraints, and they asymptotically match the lower bounds.

This talk is based on a series of papers with Emilie Kaufmann, Aurelien Garivier, Rémy Degenne, Pierre Ménard and Han Shao.


Prateek Jain, MSR India

Title: Stochastic Gradient Descent: Marrying Theory with Practice

Stochastic Gradient Descent (SGD) is the workhorse of most modern ML based solutions. The method was discovered first in 1951 (Robbins-Monro algorithm) and has since generated tremendous interest and impact especially for training deep neural networks. However, there is a significant gap between the practical versions of SGD and the stylized versions used for theoretical understanding. In this tutorial, we will highlight some of the gaps and present a few latest results that attempt at bridging it. In particular, we will talk how we can rigorously understand the following practical variants of standard SGD: a) SGD + mini-batching, b) SGD + acceleration, c) SGD with random reshuffling,  d) SGD with last point iterate.

The tutorial is based on joint works with Praneeth Netrapalli, Sham Kakade, Rahul Kidambi, Dheeraj Nagaraj, Aaron Sidford.