{"id":204,"date":"2021-02-02T12:30:00","date_gmt":"2021-02-02T12:30:00","guid":{"rendered":"http:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/maddie-smith\/?p=204"},"modified":"2024-01-19T09:51:08","modified_gmt":"2024-01-19T09:51:08","slug":"learn-from-your-mistakes-multi-armed-bandits","status":"publish","type":"post","link":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/maddie-smith\/2021\/02\/02\/learn-from-your-mistakes-multi-armed-bandits\/","title":{"rendered":"Learn From Your Mistakes &#8211; Multi-armed Bandits"},"content":{"rendered":"\n<p>In a recent talk given to the MRes students, I was asked for my opinion on a multi-armed bandit problem. In these working from home times, I\u2019m sure most of us know of the combined dread and panic that comes with taking your microphone off mute to speak on a call. I contemplated the question, and then gave my answer. As you might have guessed from the title of this talk, I was wrong. But I certainly wouldn\u2019t get this problem wrong again, because I had learned from my mistake.&nbsp;<\/p>\n\n\n\n<p>Ironically enough, learning from your mistakes, or past behaviour, is an idea that is strongly rooted in the multi-armed bandit problem.<em> And thus, a blog post was inspired!&nbsp;<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span style=\"text-decoration: underline\">Multi-armed bandits<\/span><\/h2>\n\n\n\n<p class=\"has-text-color\" style=\"color:#03989e\"><strong>So, let\u2019s get started. What exactly is a multi-armed bandit?<\/strong><\/p>\n\n\n\n<p>When most people hear \u2018multi-armed bandit\u2019, they may think of gambling. That is because a one-armed bandit is a machine where you can pull a lever, with the hopes of winning a prize. But of course, you may not win anything at all. It is this idea which constitutes a multi-armed bandit problem.<\/p>\n\n\n\n<p>Mutli-armed bandit problems are a class of problems where we can associate a particular <strong>score<\/strong> with each of our decisions at each point in time. This score includes the <strong>immediate benefit<\/strong> or cost of making that decision, plus some<strong> future benefit.&nbsp;<\/strong><\/p>\n\n\n\n<p>Imagine that we have K independent one-armed bandits, and we get to play one of these bandits at each time <em>t<\/em> = 0, 1, 2, 3, \u2026. These are very simple bandits, where we either win or lose upon pulling the arm. We\u2019ll define losing as simply winning nothing.<\/p>\n\n\n\n<p>Now, if your win occurs at some time <em>t<\/em>, then you gain some discounted reward <em>a<sup>t<\/sup><\/em>, where 0 &lt; <em>a<\/em> &lt; 1. Clearly, rewards are discounted over time; this means that a reward in the future is worth less to you than a reward now. <em>The mathematically minded among you may realise that this means that the total reward we could possibly earn is bounded.<\/em><\/p>\n\n\n\n<p>The probability of a success upon pulling bandit <em>i<\/em> is unknown, and denoted by \u03c0<sub>i<\/sub>. Since these success probabilities are unknown, we have to learn about what the \u03c0<sub>i<\/sub>s are and profit from them as we go. This means that, in the early stages we have to pull some of the arms just to see what \u03c0<sub>i<\/sub> might be like.<\/p>\n\n\n\n<p>At each time <em>t<\/em>, our choice of which bandit to play is informed by each bandit\u2019s record of successes and failures to date. For example, if I know that bandit 2 has given me a success every time I pulled it, I might be inclined to pull bandit 2 again. On the other hand, if bandit 4 has given me a failure most of the times, I might want to avoid this bandit. Thus, we are using previous data which we have obtained about each of the bandits in order to update our beliefs about the bandits\u2019 probability distributions.&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span style=\"text-decoration: underline\">The Maths<\/span><\/h2>\n\n\n\n<p>Updating our beliefs about the probability distributions of the bandits in this way is using an interpretation of statistics called <strong>Bayesian<\/strong>. Let\u2019s imagine that we have a parameter that we want to determine (in our case, the probability of success for each of the K bandits). Maybe we have some prior ideas about what the probability of success for each bandit will be. This could be due to previous experiments you know about, or maybe just personal beliefs. These prior beliefs are described by the <strong>prior distributions <\/strong>for the parameters.&nbsp;<\/p>\n\n\n\n<p>Then, let\u2019s imagine we begin our bandit experiment. After time <em>t<\/em>, we have pulled <em>t<\/em> bandits, and we now have information detailing the number of successes and failures for each of the bandits pulled. At this time, we take this observed data into account in order to modify what we think the probability distributions for the parameters look like. These updated distributions are called the <strong>posterior distributions<\/strong>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span style=\"text-decoration: underline\">The Question\u2026<\/span><\/h2>\n\n\n\n<p class=\"has-text-color\" style=\"color:#03989e\"><strong>How do we play the one-armed bandits in order to maximise the total gain?&nbsp;<\/strong><\/p>\n\n\n\n<p>Well, this is where the importance of learning from your mistakes comes in. Imagine the case where K = 5, and at time t = 7 our observed data are as follows:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"343\" src=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/maddie-smith\/wp-content\/uploads\/sites\/27\/2021\/01\/Screenshot-2021-01-31-at-12.08.43-1024x343.png\" alt=\"Table showing the results from the past seven arm plays. Bandit 1 has been pulled 4 times, with 3 successes and 1 fail. Bandit 2 hasn't been played. Bandit 3 has been pulled once and achieved one success. Bandit 4 has been pulled twice and received one success and one fail. Bandit 5 hasn't been played. \" class=\"wp-image-212\" srcset=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/maddie-smith\/wp-content\/uploads\/sites\/27\/2021\/01\/Screenshot-2021-01-31-at-12.08.43-1024x343.png 1024w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/maddie-smith\/wp-content\/uploads\/sites\/27\/2021\/01\/Screenshot-2021-01-31-at-12.08.43-300x100.png 300w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/maddie-smith\/wp-content\/uploads\/sites\/27\/2021\/01\/Screenshot-2021-01-31-at-12.08.43-768x257.png 768w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/maddie-smith\/wp-content\/uploads\/sites\/27\/2021\/01\/Screenshot-2021-01-31-at-12.08.43.png 1242w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Looking at our observed data, bandit 1 has the highest proportion of successes out of the number of&nbsp;times it was played. So maybe we would want to pull Bandit 1 again at our next time?<\/p>\n\n\n\n<p class=\"has-text-color\" style=\"color:#03989e\"><strong>Well, maybe not. <\/strong><\/p>\n\n\n\n<p>Notice that Bandits 2 and 5 haven\u2019t been played yet at all; therefore we can\u2019t really infer any data about them. Pulling Bandit 1 might give us a success on our next attempt, but it could also give us a fail. We know nothing about Bandits 2 and 5, perhaps these bandits have a probability of success of 1?&nbsp;<\/p>\n\n\n\n<p>The idea of pulling the best bandits to maximise your expected number of successes relies on a balance between <strong>exploration and exploitation<\/strong>. Exploration refers to playing bandits that we don\u2019t know much about, in this case, pulling arms 2 and 5. Exploitation, on the other hand, means pulling the arms of bandits that we already know might give us a good result; in our case, pulling bandit 1 again.&nbsp;<\/p>\n\n\n\n<p>Every time we pull an arm, we are actually receiving both a present benefit and a future benefit. The present benefit is what we win, and the future benefit is what we learn. Recall that at time <em>t<\/em>, if we win, we receive a discounted reward of <em>a<sup>t<\/sup><\/em>. Therefore, the closer <em>a<\/em> is to 1 determines how important it is to learn about the future. For example, if <em>a<\/em> is closer to one, many future pulls will make a big difference. On the other hand, if<em> a<\/em> is closer to 0, our present benefit is more important. Again, this relates to the importance of balancing exploration and exploitation.<\/p>\n\n\n\n<p>So how do we quantify both the immediate benefit and the future benefit from these bandits, in such a way so that we can play the bandits to maximise our total gain? It is possible to take the posterior distribution for each bandit at each point in time, and associate a score with it that represents both future and present benefit of pulling that bandit.&nbsp;<\/p>\n\n\n\n<p>It turns out that there is something called a <strong>Gittins Index<\/strong>, which is a value that can be assigned to every bandit. By looking at the Gittins Index for all K bandits at time t, it is Bayes optimal to play any bandit which has the largest Gittins index, which is pretty cool! <\/p>\n\n\n\n<p>So there we have it, learning from past experiences is critical in the world of multi-armed bandits, and also in many other areas of mathematics including reinforcement learning and bayesian optimisation. The trade off between exploration and exploitation is critical, and perhaps even serves as a reminder to the importance of not being afraid to make mistakes in our everyday lives&#8230; just a thought.<\/p>\n\n\n\n<p>If you enjoyed this post, be sure to check out the following further reading resources!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Further Reading<\/h2>\n\n\n\n<p><strong><a href=\"https:\/\/arxiv.org\/abs\/1904.07272\">Introduction to Multi-Armed Bandits &#8211; Aleksandrs Slivkins<\/a><\/strong> This book gives a good introduction to multi-armed bandits for those interested, and goes into detail on some of their many uses. <\/p>\n\n\n\n<p><a href=\"https:\/\/www.cs.cornell.edu\/courses\/cs6840\/2017sp\/lecnotes\/6840sp17R_Kleinberg.pdf\"><strong>These lecture notes <\/strong><\/a>are really great for those of you looking for a gentle introduction to the Gittins index and multi-armed bandits! Those new to the subject may also want to look at <a href=\"https:\/\/en.wikipedia.org\/wiki\/Gittins_index\">the Wikipedia page<\/a> as a good starting point.<\/p>\n\n\n\n<p><a href=\"https:\/\/www.jstor.org\/stable\/2335176?seq=2#metadata_info_tab_contents\"><strong>A Dynamic Allocation Index for the Discounted Multiarmed Bandit Problem &#8211; J. C. Gittins and D. M. Jones. <\/strong><\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/www.jstor.org\/stable\/2984953?seq=1#metadata_info_tab_contents\"><strong>Multi-Armed Bandits and the Gittins Index &#8211; P. Whittle.<\/strong><\/a><\/p>\n\n\n\n<p>If you&#8217;re interested in exploration vs exploitation in reinforcement learning, <a href=\"https:\/\/towardsdatascience.com\/exploration-in-reinforcement-learning-e59ec7eeaa75\">check out this post<\/a> and <a href=\"https:\/\/www.manifold.ai\/exploration-vs-exploitation-in-reinforcement-learning\">this post<\/a>.<\/p>\n\n\n\n<p>If you&#8217;re interested in exploration vs exploitation in Bayesian optimisation, <a href=\"http:\/\/krasserm.github.io\/2018\/03\/21\/bayesian-optimization\/\">check out this post<\/a>. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>In a recent talk given to the MRes students, I was asked for my opinion on a multi-armed bandit problem. In these working from home times, I\u2019m sure most of us know of the combined dread and panic that comes with taking your microphone off &hellip; <\/p>\n","protected":false},"author":28,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[14],"tags":[11,15,6],"class_list":["post-204","post","type-post","status-publish","format-standard","hentry","category-optimisation","tag-operational-research","tag-optimisation","tag-statistics"],"_links":{"self":[{"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/maddie-smith\/wp-json\/wp\/v2\/posts\/204","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/maddie-smith\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/maddie-smith\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/maddie-smith\/wp-json\/wp\/v2\/users\/28"}],"replies":[{"embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/maddie-smith\/wp-json\/wp\/v2\/comments?post=204"}],"version-history":[{"count":16,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/maddie-smith\/wp-json\/wp\/v2\/posts\/204\/revisions"}],"predecessor-version":[{"id":393,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/maddie-smith\/wp-json\/wp\/v2\/posts\/204\/revisions\/393"}],"wp:attachment":[{"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/maddie-smith\/wp-json\/wp\/v2\/media?parent=204"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/maddie-smith\/wp-json\/wp\/v2\/categories?post=204"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/maddie-smith\/wp-json\/wp\/v2\/tags?post=204"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}