Abstracts:
Siddharth Barman, IISc
Title: Online Learning for Structured Loss Spaces
Abstract: In this talk we will consider prediction with expert advice when the loss vectors are assumed to lie in a set described by the sum of atomic norm balls. We will discuss regret bounds for a general version of the online mirror descent (OMD) algorithm that uses a combination of regularizers, each adapted to the constituent atomic norms. This general result recovers standard OMD regret bounds, and yields regret bounds for new structured settings where the loss vectors are (i) noisy versions of points from a lowrank subspace, (ii) sparse vectors corrupted with noise, and (iii) sparse perturbations of lowrank vectors.
Upinder Bhalla, NCBS
Title: Chemical learning logic in biological neural networks
Abstract: Biological systems are very efficient at learning by example and by reinforcement. An organism typically only has a few trials to learn about environmental associations before it starves or is eaten. These learning rules are further constrained by the architecture of the nervous system, in which plausible learning rules are local to each connection (synapse) and backward propagation of information is excluded. We have been working to systematically catalogue what is experimentally known about the natural rules for learning, and to model these using mechanistically detailed signaling pathways. In the brain, these learning rules are implemented through networks of chemical signalling pathways, which we model through systems of differential equations. I will show how our detailed chemical network model works to account for a wide range of experiments which we have structured into a database for automated model testing and optimization. I propose that these signaling networks implement learning rules that are a good reference point for analyzing biological neural network learning and may have implications for machine learning calculations.
Vivek Borkar, IITBombay
Title: A reinforcement learning algorithm for restless bandits
Abstract: This talk will report some recent work on a reinforcement learning algorithm for learning the Whittle indices for restless bandits, illustrate it with an application to scheduling of web crawlers for ephemeral content, and then try to make a case for exploiting structural properties of Markov decision processes, when available, for better reinforcement learning algorithms
Munther Dahleh, MIT
Title: Networks and Data for Society
Abstract: The emergence of large networked systems has brought about new challenges to researchers and practitioners alike. While such systems perform well under normal operations, they can exhibit fragility in response to certain disruptions that may lead to catastrophic cascades of failures. This phenomenon, sometimes referred to as systemic risk, emphasizes the role of the system interconnection in causing such, possibly rare, events. The flash crash of 2010, the financial crisis of 2008, the New England power outage of 2003, or simply extensive delays in air travel, are just a few of many examples of fragility and systemic risk present in complex interconnected systems.
In this talk I will discuss this emerging area for critical infrastructures. Such applications involve the interaction between physical systems and social networks. I will highlight some of the interesting research questions involving fragility and cascaded failure. I will discuss new problems in learning unstructured dynamical systems. And finally, I will highlight the important of understanding the economic value of data in the context of such underlying structures.
Bio:
Munther A. Dahleh the William A. Coolidge Professor of EECS. He is currently the director of the newly formed MIT Institute for Data, Systems, and Society. Dr. Dahleh is interested in Networked Systems with applications to Social and Economic Networks, Transportation Networks, Neural Networks, and the Power Grid. His recent work focuses on market design for digital goods and services. His work draws from various fields including game theory, optimal control, distributed optimization, information theory, and distributed learning. Dr. Dahleh is the coauthor (with Ignacio DiazBobillo) of the
Book: Control of Uncertain Systems: A Linear Programming Approach, published by PrenticeHall, and the coauthor (with Nicola Elia) of the book Computational Methods for Controller Design published by Springer. He is a fellow of IEEE and IFAC professional societies. He earned his PhD in ECE from Rice University.
Yash Deshpande, MIT
Title: Estimating lowrank matrices in noise: phase transitions from spin glass theory
Abstract: Estimating lowrank matrices from noisy observations is a common task in statistical and engineering applications. Following the seminal work of Johnstone, Baik, BenArous and Peche, versions of this problem have been extensively studied using random matrix theory. In this talk, we will consider an alternative viewpoint based on tools from mean field spin glasses. We will present two examples that illustrate how these tools yield information beyond those from classical random matrix theory. The first example is the twogroups stochastic block model (SBM), where we will obtain a full informationtheoretic understanding of the estimation phase transition. In the second example, we will augment the SBM with covariate information at nodes, and obtain results on the altered phase transition.
This is based on joint works with Emmanuel Abbe, Andrea Montanari, Elchanan Mossel and Subhabrata Sen
Ali Jadbabaie, MIT
Title: Optimisation for Machine Learning: Demystifying the Role of Acceleration
Abstract: In this talk, we will clarify the connection between the widely popular accelerated first order optimisation methods and the secondorder heavy ball ordinary differential equation (ODE) studied in the seminal works of Polyak and Nesterov. We further provide the utility of such approaches in creating new distributed optimisation algorithms suitable for distributed Empirical Risk Minimisation.
When the function is convex with Lipschitz gradient, we show that acceleration achieved by the ODE discretization scheme depends on the integration order and the flatness of the function around itsminimum. In the strongly convex setting, we show how the convergence rate approaches the optimal rate as the order of discretization is increased.
Joint work with Jingzhao Zhang, Suvrit Sra, and Aryan Mokhtari
Prateek Jain, Microsoft
Title: Hard Thresholding based methods for Robust Learning
Abstract: Learning in presence of outliers is a critical problem that can heavily affect performance of the learning algorithms in practice. In this talk, we present a general approach for learning with outliers, where we iteratively estimate the model parameters with estimated inliers and threshold out point which seems unlikely to be generated from the model to obtain more refined set of inliers. We instantiate this general approach for the outlier efficient PCA problem and demonstrate that it leads to nearly optimal solution in O(PCA) computation time.
Varun Jog, WisconsinMadison
Tutorial:
Title: Bridging the inequality gap
Abstract: Reconstructing probability distributions from projections is a fundamental problem in many engineering applications, including compressed sensing, CT/PET imaging, and seismology. Geometric and information theoretic inequalities provide important mathematical tools for understanding the behavior of such projectionsin particular, characterizing extremal distributions with respect to different lowerdimensional properties of interest. This talk will consist of two parts: First, we introduce new methods to bound volumes and surface areas of unseen geometric objects using information derived from lowerdimensional projections. Second, we explore inequalities relating the entropies of a distribution and its lowerdimensional projections. The common theme is a powerful family of functional inequalities known as the BrascampLieb Inequalities. Building upon the work of Carlen & CorderoErausquin (2008) and Geng & Nair (2014), we develop a superadditivity result for a new notion of Fisher information and a novel proof strategy for proving an inequality that unifies the BrascampLieb Inequalities and the Entropy Power Inequality.
Lecture:
Title: Information theoretic perspectives on learning algorithms
Abstract: In statistical learning theory, generalization error is used to quantify the degree to which a supervised machine learning algorithm may overfit to training data. We overview some recent work [Xu and Raginsky (2017)] that bounds generalization error of empirical risk minimization based on the mutual information I(S;W) between the algorithm input S and the algorithm output W. We leverage these results to derive generalization error bounds for a broad class of iterative algorithms that are characterized by bounded, noisy updates with Markovian structure, such as stochastic gradient Langevin dynamics (SGLD). We describe certain shortcomings of mutual informationbased bounds, and propose alternate bounds that employ the Wasserstein metric from optimal transport theory. We compare the Wasserstein metricbased bounds with the mutual informationbased bounds and show that for a class of data generating distributions, the former leads to stronger bounds on the generalization error
Sandeep Juneja, TIFR
Title: Sample complexity of partition identification using multiarmed bandits (jointly with Subhashini Krishnasamy)
Abstract: Given a vector of probability distributions, or arms, each of which can be sampled independently, we consider the problem of identifying the partition to which this vector belongs from a finitely partitioned universe of such vector of distributions. We study this as a pure exploration problem in multi armed bandit settings and develop sample complexity bounds on the total mean number of samples required for identifying the correct partition with high probability. This framework subsumes well studied problems in the literature such as finding the best arm or the best few arms. We consider distributions belonging to the single parameter exponential family and primarily consider partitions where the vector of means of arms lie either in a given set or its complement. The sets considered correspond to distributions where there exists a mean above a specified threshold, where the set is a half space and where either the set or its complement is convex. This includes polytopes and their complements. In all these settings, we characterize the lower bounds on mean number of samples for each arm. Further, we propose algorithms that can match these bounds asymptotically with decreasing probability of error.
PoLing Loh, WisconsinMadison
Title: Theory for local optima of nonconvex highdimensional Mestimators
Abstract: In this short course, we will survey a variety of recent results concerning local optima of penalized Mestimators, designed for highdimensional regression problems. The nonconvexity is allowed to arise in either the loss function or the regularizer. Although the overall landscape of the objective function is nonconvex in high dimensions, we are able to establish that both local and global optima are statistically consistent under appropriate conditions. Our theory is applicable to settings involving errorsinvariables models and other contaminated data scenarios. In the second part of the course, we will discuss statistical and optimization theory for nonconvex Mestimators suited for robust regression in high dimensions.
Sanjoy Mitter, MIT
Title: Inference , Learning and Approximation .
Abstract: In this talk I discuss the similarities and distinctions between the problems of Approximation, Inference and Learning
Karthyek Murthy, SUTD
Title: Learning with Distributionally Robust Optimization based on OptimalTransport Distances
Abstract: The objective of this talk is to showcase optimal transport distances (which include the wellknown Wasserstein distances as a special case) as an effective tool towards improving robustness and generalization while using empirical risk as a vehicle to minimize population risk. Unlike empirical risk minimization which identifies the “best” parameter choice for the given data, here we propose to look for parameter choices that perform uniformly good for any choice of probability distribution that is within a fixed radius (quantified by a suitable optimal transport distance)from the empirical distribution representing the data. Such a “distributionally robust approach” is known to recover widely employed regularization based schemes in special cases. We extend the applicability by showing that, for a large class of useful models, the proposed distributionally robust formulations can be solved at least as fast, or even faster than the empirical risk minimization problem. We also discuss the asymptotic normality and robustness properties of the resulting estimators.
Chandra Nair, CUHK
Title: Nonconvex optimization and multiuser information theory
Abstract: Capacity regions in multiuser information theory were traditionally established using an achievability argument and converses to corresponding rate regions. Recently, however, capacity regions have been established, for nontrivial and important classes of channels (such as additive Gaussian noise models and others), by optimizing functionals on probability spaces generated from inner or outer bounds and showing that the bounds match for these channels. Optimization ideas have also been used to determine if certain achievable regions proposed in the literature are optimal or suboptimal in general. The study of these nonconvex optimization problems required developing new insights and techniques. In this talk, I will outline the ideas involved in obtaining the results, as well as illustrate the relationship of these problems to hypercontractivity and related notions studied primarily outside information theory. I will present some unifying observations across the family of optimization problems that we have managed to solve, and state some elementary instances of similar problems whose solutions would be of immense interest.
Biography: Chandra Nair is an Associate Professor with the Information Engineering department at The Chinese University of Hong Kong. His research interests and contributions have been in developing ideas, tools, and techniques to tackle families of combinatorial and nonconvex optimization problems arising primarily in the information sciences.
His recent research focus has been on studying the optimality of certain inner and outer bounds to capacity regions for fundamental problems in multiuser information theory. He, along with his doctoral student Yanlin Geng, received the 2016 Information Theory Society paper award for developing a novel way to establish the optimality of Gaussian distributions. A proof of the Parisi and CoppersmithSorkin conjectures in the Random Assignment Problem constituted his doctoral dissertation; and he resolved some conjectures related to Random Energy model approximation of the Number Partition Problem during his postdoctoral years.
Chandra Nair got his Bachelor's degree, B.Tech (EE), from IIT Madras (India) where he was the Philips (India) and Siemens (India) award winner for the best academic performance. Subsequently he was a Stanford graduate fellow (0004) and a Microsoft graduate fellow (0405) during his graduate studies at the EE department of Stanford university. Later, he became a postdoctoral researcher (0507) with the theory group at Microsoft Research, Redmond. He has been a faculty member of the information engineering department at The Chinese university of Hong Kong since Fall 2007. He was an associate editor for the IEEE Transactions on Information Theory (20142016) and is currently a distinguished lecturer of the IEEE Information theory society. He is a fellow of the IEEE.
He serves as the Programme Director of the undergraduate program on Mathematics and Information Engineering and the Director of the Institute of Theoretical Computer Science and Communications.
Hariharan Narayanan, TIFR
Title: Fitting a putative manifold to noisy data.
Abstract: We give a solution to the following question from manifold learning:
Suppose data belonging to a high dimensional Euclidean space is drawn independently, identically distributed from a measure supported on a low dimensional twice differentiable embedded compact manifold M, and is corrupted by a small amount of i.i.d gaussian noise.
How can we produce a manifold M'; whose Hausdorff distance to M is small and whose reach (normal injectivity radius) is not much smaller than the reach of M?
This is joint work with Charles Fefferman, Sergei Ivanov, Yaroslav Kurylev, and Matti Lassas.
Praneeth Netrapalli, Microsoft Research
Title: Smoothed Analysis of Low Rank Solutions to Semidefinite Programs via Burer Monteiro Factorization
Abstract: Semidefinite programs (SDP) are important in learning and combinatorial optimization with numerous applications. In pursuit of lowrank solutions and low complexity algorithms, we consider the BurerMonteiro factorization approach for solving SDPs. We show that all approximate local optima are global optima for the penalty formulation of appropriately rankconstrained SDPs as long as the number of constraints scales subquadratically with the desired rank of the optimal solution. Our result is based on a simple penalty function formulation of the rankconstrained SDP along with a smoothed analysis to avoid worstcase cost matrices.
Sunita Sarawagi, IITBombay
Title: Training Deep Models for better Generalization, Calibration, and Adaptation.
Abstract: Training deep neural networks entails huge investment of labeled data, compute power, and human effort. We can maximize our return on this investment, by expanding the breadth of usage of a trained deep model. We discuss three ways of achieving this goal. First, we show how to train the model to generalize to new domains without labeled/unlabeled data in that domain. Here we exploit the natural availability of multidomain training data to induce a continuous representation of domains, and interpolate in this space to hallucinate new domains during training.
Next, we show how to enhance interoperability with humans and other models by training wellcalibrated models. We present a new kernel embeddings inspired trainable calibration loss that is significantly more effective than pure likelihoodbased training in outputting sound probabilities with predictions. Finally, we show how to adapt the model to online feedback at deployment time. We combine a memorybased architecture with known ideas from boosting and online kernel learning to adaptively harness online labeled data in a
globally trained neural network.
Sridevi Sarma, Johns Hopkins
Title: Learning How Neural Systems Govern Behavior: Motor Control, Decisionmaking, and Epilepsy
Abstract: In this tutorial, we will demonstrate how control and estimation theory can be applied to understand how activity in various brain regions govern behavior. We will first describe how control theory can be used to describe performance limitations in sensorimotor control. In particular, we will derive tradeoffs between neural computing and accuracy in tracking fast movements (e.g. a lion chasing a gazelle). Then, we will describe how statespace modeling can capture variability in human behaviors during financial decision making (i.e., gambling), which can then be related back to neural activity. Here, we explain why some humans display the "gambler's fallacy" belief (they think they will lose due to a streak of recent wins despite odds of winning remaining the same) and why other display the "hothandedness" belief (they think they will win due to a streak of recent wins despite odds of winning remaining the same) by examining activity in the left and right hemispheres of the brain. Finally, we will conclude with how stability theory of dynamical networks systems applied to EEG networks can robustly identify where seizures start in the brain, which can inform clinicians on how to treat the most severe epilepsy patients.
Devavrat Shah, MIT
Title: Explaining the Success of Nearest Neighbor Methods in Prediction
Abstract: Many modern methods for prediction leverage nearest neighbor search to find past training examples most similar to a test example, an idea that dates back in text to at least the 11th century and has stood the test of time. In this short course, our aim is to explain the success of these methods. In particular, we wish to discuss foundational nonasymptotic statistical guarantees on nearestneighborbased regression and classification. We shall also discuss prominent methods for approximate nearest neighbor search that have been essential to scaling prediction systems reliant on nearest neighbor analysis to handle massive datasets. We discuss connections to learning distances for use with nearest neighbor methods, including how random decision trees and ensemble methods learn nearest neighbor structure, as well as recent developments in crowdsourcing and graphons. Timepermitting, we shall cover recent theoretical guarantees on nearest neighbor prediction in the three case studies of time series forecasting, recommending products to people over time, and delineating human organs in medical images by looking at image patches. In these case studies, clustering structure, which is easier to verify in data and more readily interpretable by practitioners, enables successful prediction.
Gautam Shroff, TCS
Title : Enterprise AI: Automation, Amplification, and Managing Complexity
Abstract: In this talk I shall describe in how TCS Research is applying AI techniques in traditional enterprises across verticals, not only for automation but also amplification of human tasks.
For example, I shall describe our experience with deploying deeplearning based knowledge assistants at scale within TCS. Next I shall give some examples of how AI and IOT are transforming manufacturing, by using deeplearning for detecting, diagnosing and sometimes predicting faults, as well as how deepreinforcement learning can be used to manage supply chains from source to shelf, trade in inefficient markets, or even negotiate contracts. Finally I shall explain two challenges to the scaling of AI within traditional enterprises: embracing the culture of field experimentation, and figuring out how to marry mechanisms for developing and managing complex IT systems with the AL/ML paradigm.
Piyush Srivastava, TIFR
Title: Zeros of polynomials and estimation in graphical models
Abstract: Markov chain Monte Carlo and belief propagation are two of the main techniques for estimation in graphical models. Both of these are related to a rather probabilistic paradigm of studying phase transitions in statistical mechanics, i.e., that of decay of longrange correlations.
Another classical (and perhaps more familiar and intuitive) paradigm of phase transitions is in terms of nonsmooth behavior of observable quantities. Through the classical work of Lee and Yang, this notion connects with the study of *complex* roots of "partition functions" of graphical models.
This talk surveys some recent work, starting with an idea of Alexander Barvinok, which leverages this latter theory to estimate the partition function algorithmically. Surprisingly, one currently has examples of graphical models (e.g. the Ising model with nonzero field) where this is the only known method for deterministic approximation of the partition function. This talk will be survey this curious method and perhaps serve as an exhortation for a closer look at it.
Part of this talk is based on joint work with Jingcheng Liu and Alistair Sinclair, and with Nicholas J. A. Harvey and Jan Vondrák, while the rest is a survey of recent work by several authors.
Gregory Wornell, MIT
Title: An InformationGeometric Framework for Learning in High Dimensions
Abstract: We describe a framework for addressing the problem of data feature selection prior to inference task specification, which is central to highdimensional learning. Introducing natural notions of universality for such problems, we develop a local equivalence among them. We further express the information geometry associated with this analysis, which represents a conceptually and computationally useful learning methodology. The development will highlight the interrelated roles of the singular value decomposition, HirschfeldGebeleinRenyi maximal correlation, canonical correlation and principle component analyses, Tishby's information bottleneck, Wyner's common information, Ky Fan knorms, and Brieman and Friedman's alternating conditional expectation algorithm. As will be discussed, this framework provides a basis for understanding and optimizing aspects of learning systems, including neural network architectures, matrix factorization methods for collaborative filtering, rankconstrained multivariate linear regression, and semisupervised learning, among others.
Poster Session:

Anish Agarwal: Model Agnostic Time Series Analysis Via Matrix Estimation.

Shubhada Agrawal:BestArm Identification in MultiArmed Bandits: A generalization beyond exponential families

Arghya Roy Chaudhuri : QuantileRegret Minimisation in Infinitely ManyArmed Bandit.

Somnath Chakraborty: Generating an equidistributed net on a sphere.

Anand Deo : Simple Closed Form Approximate Maximum Likelihood Estimation.

Alexander Reiffers: Information Design for Reducing Energy Consumption.

Sandhya Tripathi: Cost sensitive learning in the presence of symmetric label noise