STOR-i Seminar: David Leslie

David Leslie, University of Bristol

Monday 25 November 2013, 1500-1600
A54, Postgraduate Statistics Centre Lecture Theatre

Thompson sampling in multi-armed bandits games

When individuals are learning how to behave in an unknown environment, a statistically sensible thing to do is form posterior distributions over unknown quantities of interest (such as features of the environment and 'opponent' strategy) then select an action by integrating with respect to these posterior distributions. However reasoning with such distributions is very troublesome, even in a machine learning context with extensive computational resources; Savage himself indicated that Bayesian decision theory is only sensibly used in reasonably "small" situations.

Random beliefs is a framework in which individuals instead respond to a single sample from a posterior distribution. This is a strategy known as Thompson sampling, after its introduction in a medical trials context by Thompson (1933), and is used by both Microsoft and Google to select which adverts to show you. There is evidence from the psychological and animal behaviour disciplines to suggest that both humans and animals may also use such a strategy. In our work we prove that such behaviour 'solves' the exploration-exploitation dilemma in a contextual bandit setting, and demonstrate empirical that it does so 'better' than other provably convergent strategies. We also prove that such behaviour results in convergence to a Nash equilibrium of a normal form game.