Exploration and Exploitation in Structured Stochastic Bandits
Wouter Koolen, CWI

How can we design methods for interactive learning? In this tutorial we will model the environment as a  stochastic multi-armed bandit, and we will consider two classical tasks: regret minimization (reward maximization) and active testing (pure exploration). For each of these tasks, we will first study the limits of learning by developing information-theoretic lower bounds. These lower bounds reveal that the complexity of our sequential learning problems is governed by certain two-player zero-sum games. This games perspective allows us to develop practical algorithms based on iterative saddle point solvers. The methods that we obtain are general, powerful and flexible. They can be made to incorporate prior knowledge about the world in the form of structured bandit constraints, and they asymptotically match the lower bounds.

This talk is based on a series of papers with Emilie Kaufmann, Aurelien Garivier, Rémy Degenne, Pierre Ménard and Han Shao.

Lecture Slides