In this talk I discuss Statistical Filtering as an Informationally Dissipative System leading to a relation between Shannon Information and fisher Information. When viewed as a problem in Bayesian Inference the Optimal Fileter (Conditional Distribution ) corresponds to a variational Principle related to Free Energy Minimization. These ideas are related to the recent work of Otto, Jordan and Kinderlehrer on the variational Principle for the Fokker Planck Equation.

The fields of statistics and machine learning have made tremendous progress in the last few decades in designing accurate black-box prediction methods (boosting, random forests, bagging, neural nets, etc.) but for deployment in the real world, it would be useful to have uncertainty quantification for those point-predictions. In this tutorial, I will summarize recent work that my collaborators and I have done over the last few years, for designing a large class of such methods that work without any assumptions on the algorithm, or the distribution of the covariates, or the distribution of the labels, etc, just relying on “exchange ability” of the test and training points.

This talk is based on a sequence of joint works by the BaCaRaTi group (Rina Barber, Emmanuel Candes, myself, Ryan Tibshirani), + some recent work with a collaborator, Arun Kumar Kuchibhotla, and my student Chirag Gupta.

How can we design methods for interactive learning? In this tutorial we will model the environment as a stochastic multi-armed bandit, and we will consider two classical tasks: regret minimization (reward maximization) and active testing (pure exploration). For each of these tasks, we will first study the limits of learning by developing information-theoretic lower bounds. These lower bounds reveal that the complexity of our sequential learning problems is governed by certain two-player zero-sum games. This games perspective allows us to develop practical algorithms based on iterative saddle point solvers. The methods that we obtain are general, powerful and flexible. They can be made to incorporate prior knowledge about the world in the form of structured bandit constraints, and they asymptotically match the lower bounds.

This talk is based on a series of papers with Emilie Kaufmann, Aurelien Garivier, Rémy Degenne, Pierre Ménard and Han Shao.

Graphical models are used throughout the natural and social sciences to model statistical relationships between variables of interest. In the first lecture, we will discuss directed graphical models for causal inference; in particular, how perturbation data such as gene knockout data can be used for causal structure discovery. In the second lecture, we will discuss undirected graphical models in the context of positive dependence. In applications ranging from finance to phylogenetics latent variables introduce positive dependence between variables. We will discuss how to model this using total positivity and the intriguing properties resulting from total positivity with respect to graphical models.

We study an online and streaming unsupervised classification system. Our setting consists of a collection of classifiers (with unknown confusion matrices) each of which can classify one sample per unit time, and which are accessed by a stream of unlabeled samples. Each sample is dispatched to one or more classifiers, and depending on the labels collected from these classifiers, may be sent to other classifiers to collect additional labels. The labels are continually aggregated. Once the aggregated label has high enough accuracy (a pre-specified threshold for accuracy) or the sample is sent to all the classifiers, the now labeled sample is ejected from the system. For any given pre-specified threshold for accuracy, the objective is to sustain the maximum possible rate of arrival of new samples, such that the number of samples in memory does not grow unbounded.

In this talk, we characterize the Pareto-optimal region of accuracy and arrival rate, and discuss an algorithm that can operate at any point within this region. Our algorithm uses queueing-based routing and scheduling approaches combined with online tensor decomposition method to learn the hidden parameters, and provides Pareto-optimality guarantees. We finally verify our theoretical results through simulations on two ensembles formed using AlexNet, VGG, and ResNet deep image classifiers. Based on joint work with Soumya Basu, Steven Gutstein and Brent Lance.

Surrogate loss functions are widely used in machine learning. In particular, for many machine learning problems, the ideal objective or loss function is computationally hard to optimize, and therefore one instead works with a (usually convex) surrogate loss that can be optimized efficiently. What are the fundamental design principles for such surrogate losses, and what are the associated statistical behaviors of the resulting algorithms?

This tutorial will some provide answers to some of these questions. In particular, we will review the theories of "proper" surrogate losses and "calibrated" surrogate losses, both of which yield statistically consistent learning algorithms for the true learning objective. We will also discuss recent results that draw connections between these two classes of surrogate losses via the framework of property elicitation, establishing that all calibrated surrogates can effectively be viewed as proper surrogates that estimate a suitable "sufficient" property of the data distribution. With these tools, the task of designing of a statistically consistent learning algorithm reduces to the task of identifying a suitable "sufficient" property for the target learning problem, and then designing a proper surrogate loss for estimating this property.

Stochastic Gradient Descent (SGD) is the workhorse of most modern ML based solutions. The method was discovered first in 1951 (Robbins-Monro algorithm) and has since generated tremendous interest and impact especially for training deep neural networks. However, there is a significant gap between the practical versions of SGD and the stylized versions used for theoretical understanding. In this tutorial, we will highlight some of the gaps and present a few latest results that attempt at bridging it. In particular, we will talk how we can rigorously understand the following practical variants of standard SGD: a) SGD + mini-batching, b) SGD + acceleration, c) SGD with random reshuffling, d) SGD with last point iterate.

The tutorial is based on joint works with Praneeth Netrapalli, Sham Kakade, Rahul Kidambi, Dheeraj Nagaraj, Aaron Sidford.