Multi-Armed Bandits

My research interests are in multi-armed bandits under real world conditions. I am supervised by Steffen Grünewälder (Lancaster University) and Giovanni Zappella (Amazon Research Berlin).

Multi-Armed Bandits

The multi-armed bandit problem, named after the bandits we'd come across in places like Las Vegas, is an optimisation problem in which we aim to maximise payoff over a finite time horizon. In this problem, we have a row of arms which we can pull to receive a potential payoff. It is clear that we have very limited knowledge of the environment and we have to play the arms to learn about their potential payoffs. Therefore, it is clear that we need some method to balance exploring new arms and exploiting the best arm under our current knowledge. Such approaches include index methods like upper-confidence bounds in which we play the bandit with the best index which we update as we play. Another approach is Thompson sampling which is a heuristic which chooses the arm that has the best randomly drawn expected reward.

Multi-Armed Bandits in Practice

These multi-armed bandits problems appear in a variety of applications. These are widely used in STOR-i for applications such as search algorithms, recommendation systems and clinical trials. For example, in recommendation systems the bandits could correspond to adverts or recommendations and the payoff will be 1 if the user clicks on the advert or 0 if they don't click. However, this is an over simplification of how users will interact with these adverts. Therefore, we must alter the bandit problem to capture these finer details of the real world. These alterations mean that the standard methods used to tackle the problem need to be altered, and in some cases new approaches need to be considered.

Real World Conditions

Returning to the advertisment setting, as a business we could be more concerned about the money made when a sale is made from an advert rather than the feedback received when a user clicks on it. Such feedback is delayed and normally anonymised. This is one example of an alteration to the bandit problem which has been previously considered. Another assumption of the standard problem is that the rewards are stationary, but in advertisement we'd expect the adverts to become less effective as we play them. Similarly, these adverts can die out, for example if a new phone is released there is no gain in advertising the models which come before it. Finally, one can consider the rigidity of these approaches and construct more adaptive algorithms, this approach requires breaking assumptions which restrict the current form of bandit algorithms.

Back to Top