{"id":487,"date":"2021-04-18T10:23:48","date_gmt":"2021-04-18T10:23:48","guid":{"rendered":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/?p=487"},"modified":"2021-05-10T10:27:42","modified_gmt":"2021-05-10T10:27:42","slug":"reinforcement-learning-the-end-game","status":"publish","type":"post","link":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/2021\/04\/18\/reinforcement-learning-the-end-game\/","title":{"rendered":"Reinforcement Learning: The Ancient game of Go &#8211; Humans vs. AI."},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"487\" class=\"elementor elementor-487\">\n\t\t\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-02e3be3 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"02e3be3\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-a302292\" data-id=\"a302292\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-c1cafb4 elementor-widget elementor-widget-text-editor\" data-id=\"c1cafb4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><em>Last time we introduced that the limit to the traditional Q-Learning and Sarsa methods lies in the need to store each Q-value for every state-action pair in some giant array, and while this is fine for millions or tens-of-millions of states, there are \\(10^{170} \\) unique states in the ancient boardgame Go. <br \/><\/em><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-a49ff6a elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"a49ff6a\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-a46fc1e\" data-id=\"a46fc1e\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-b7e8c1e elementor-widget-divider--view-line elementor-widget elementor-widget-divider\" data-id=\"b7e8c1e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"divider.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<div class=\"elementor-divider\">\n\t\t\t<span class=\"elementor-divider-separator\">\n\t\t\t\t\t\t<\/span>\n\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-b0a7f56 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"b0a7f56\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-3d5b223\" data-id=\"3d5b223\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-1230d19 elementor-widget elementor-widget-text-editor\" data-id=\"1230d19\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>For the extremely large scale problems we are limited in applying traditional RL methods, not only by the need to hold in memory the estimated learned values (typically in some kind of giant table or array), but the time required to discover and have our agent complete the array. In such large problems our agent will regularly encounter new states, and so there is need of some general function approximation to represent the necessary value or policy functions. In addition, although not further discussed here, a continuous observation\/action space, or an infinite MDP causes significant computational limitations under traditional methods, and so introducing some approximate method becomes necessary in these scenarios aswell.<\/p><p>Common methods include Stochastic Gradient Decent and Linear Function approximation, but I&#8217;d like to discuss the application of Artificial Neural Networks to RL. For general approximation methods, Sutton &amp; Barto (2018) has the better part of 200 pages dedicated to this topic.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-5d62f94 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"5d62f94\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-3f2b3af\" data-id=\"3f2b3af\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-618068a elementor-widget-divider--view-line elementor-widget elementor-widget-divider\" data-id=\"618068a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"divider.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<div class=\"elementor-divider\">\n\t\t\t<span class=\"elementor-divider-separator\">\n\t\t\t\t\t\t<\/span>\n\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-7fbaf80 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"7fbaf80\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-ae0aa6a\" data-id=\"ae0aa6a\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-f820206 elementor-widget elementor-widget-text-editor\" data-id=\"f820206\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>An Artificial Neural Network (ANN) is fundamentally a model designed to imitate how the human brain works. It is a collection of Nodes that each make up different layers where nodes are connected between the layers (see below). Each node contains an activation function which determines what the node outputs. Typically this is a sigmoid function that outputs a binary signal, which is registered by the next layer of nodes deeper into the structure who each in turn perform the same process. Each layer has the objective of learning about some feature of the system and has the goal to minimise some error term. RL looks to use ANNs this way to estimate value functions, maximize expected reward or estimate some policy-gradient algorithm.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-2e22648 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"2e22648\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-44d5cec\" data-id=\"44d5cec\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-6261e08 elementor-widget elementor-widget-image\" data-id=\"6261e08\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img fetchpriority=\"high\" decoding=\"async\" width=\"461\" height=\"280\" src=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-content\/uploads\/sites\/28\/2021\/05\/ANNexamplebetter.jpg\" class=\"attachment-large size-large wp-image-494\" alt=\"ANNexamplebetter\" srcset=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-content\/uploads\/sites\/28\/2021\/05\/ANNexamplebetter.jpg 461w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-content\/uploads\/sites\/28\/2021\/05\/ANNexamplebetter-300x182.jpg 300w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-content\/uploads\/sites\/28\/2021\/05\/ANNexamplebetter-24x15.jpg 24w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-content\/uploads\/sites\/28\/2021\/05\/ANNexamplebetter-36x22.jpg 36w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-content\/uploads\/sites\/28\/2021\/05\/ANNexamplebetter-48x29.jpg 48w\" sizes=\"(max-width: 461px) 100vw, 461px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">An ANN with Four Inputs &amp; Three Hidden Layers (Miller 2017)<\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-dd2ed88 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"dd2ed88\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-97bb304\" data-id=\"97bb304\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-75f89db elementor-widget elementor-widget-text-editor\" data-id=\"75f89db\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Deep Learning (a Neural Network comprised of many layers or DNN) is most commonly associated with image recognition, classification or advanced text generation (See everything the amazing OpenAI team is working on). The general process is the same for DNNs, that, for example, when an image is shown to a neural network the image is split up into different inputs, processed by the network and some output is determined. Any error in output is pushed back through the network impacting the weightings of different nodes\/neurons that make up the network. This is the process of Back-propagation and is commonly how both Artificial Neural Networks and Deep Learning systems <em>learn<\/em>. As Deep Learning tends to learn very slowly, it can be difficult to apply practically. However, in RL our agent can experience thousands of episodes and therefore obtain thousands of training observations very quickly, making Deep Learning specifically very applicable to approximating various learning equations required for RL. This process is called Deep Reinforcement Learning (DRL).<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-49e1395 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"49e1395\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-86c75fc\" data-id=\"86c75fc\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-5c0ffd3 elementor-widget-divider--view-line elementor-widget elementor-widget-divider\" data-id=\"5c0ffd3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"divider.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<div class=\"elementor-divider\">\n\t\t\t<span class=\"elementor-divider-separator\">\n\t\t\t\t\t\t<\/span>\n\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-8589d1c elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"8589d1c\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-12a8fef\" data-id=\"12a8fef\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-58a338d elementor-widget elementor-widget-text-editor\" data-id=\"58a338d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>On the 9th April 2015, Google DeepMind Patented an algorithm \\\\ (https:\/\/patents.google.com\/patent\/US20150100530A1\/en) called the Deep Q-Network (DQN). DQN uses a Deep Neural Network to map a high dimensional state-action space into Q-Value estimates, which traditional Q-learning can apply. DQN uses Experience Replay, Target Network and Reward Clipping methods to handle the high variance in learning caused by a typically high correlation in inputs that the agent receives. They achieve this randomising by the order of a set of recent observations to remove the high correlation of observations, and only periodically update the estimates of the state-action values between the Neural Network and the Q-Learning Algorithm. The state-value update step for DQN follows<\/p><p>\\begin{equation}\\label{DQN}<br \/>Q(s_t,a_t) \\xleftarrow[]{} Q(s_t,a_t) + \\alpha[r_{t+1} + \\gamma max_{a} Q(s_{t+1},a | \\theta) &#8211; Q(s_t,a_t | \\theta)],<br \/>\\end{equation}<\/p><p>where \\( \\theta \\) is the state-action estimated values provided by the Neural Network, hence\\( Q(s_t,a_t | \\theta) \\) is the predicted Q-Value and \\( Q(s_{t+1},a | \\theta)\\) is the target\/objective Q-Value of the Neural Network. Lanham (2020) provides a detailed working example in python of DQNs using PyTorch.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-8454825 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"8454825\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-ef2bfe1\" data-id=\"ef2bfe1\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-2274ee2 elementor-widget-divider--view-line elementor-widget elementor-widget-divider\" data-id=\"2274ee2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"divider.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<div class=\"elementor-divider\">\n\t\t\t<span class=\"elementor-divider-separator\">\n\t\t\t\t\t\t<\/span>\n\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-22c9719 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"22c9719\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-edd6d2f\" data-id=\"edd6d2f\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-1ea7486 elementor-widget elementor-widget-text-editor\" data-id=\"1ea7486\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Because Deep Reinforcement Learning is designed to be applied to the most complex of scenarios and is capable of handing situations where the agent lacks complete information in an online scenario, some of the most high profile examples of Reinforcement Learning are found to be applying methods from Deep Reinforcement Learning.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-4421f0f elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"4421f0f\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-9403e55\" data-id=\"9403e55\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-ce57d75 elementor-widget elementor-widget-image\" data-id=\"ce57d75\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" width=\"612\" height=\"408\" src=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-content\/uploads\/sites\/28\/2021\/05\/gettyimages-1251189536-612x612-1.jpg\" class=\"attachment-large size-large wp-image-498\" alt=\"Close up wooden board for playing go with black and white stones on shadow\" srcset=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-content\/uploads\/sites\/28\/2021\/05\/gettyimages-1251189536-612x612-1.jpg 612w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-content\/uploads\/sites\/28\/2021\/05\/gettyimages-1251189536-612x612-1-300x200.jpg 300w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-content\/uploads\/sites\/28\/2021\/05\/gettyimages-1251189536-612x612-1-24x16.jpg 24w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-content\/uploads\/sites\/28\/2021\/05\/gettyimages-1251189536-612x612-1-36x24.jpg 36w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-content\/uploads\/sites\/28\/2021\/05\/gettyimages-1251189536-612x612-1-48x32.jpg 48w\" sizes=\"(max-width: 612px) 100vw, 612px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-763075b elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"763075b\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-4f1acfc\" data-id=\"4f1acfc\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-38d11ff elementor-widget elementor-widget-text-editor\" data-id=\"38d11ff\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>During the last 30 years, humans have been surpassed by Artificial Intelligence in many board games, such as Backgammon and Chess, but the game of <em>Go<\/em> proved difficult to produce an AI algorithm that could challenge the highest rated players in the world. In 2016, AlphaGo, developed by DeepMind, successfully defeated the at time current 18-time world Go champion Lee Sedol in Korea 4 wins to 1 loss. AlphaGo was designed with a combination of Deep Reinforcement Learning, Supervised learning and Monte Carlo tree search which relied on a huge database of professional human moves. In 2017 DeepMind released AlphaGo Zero as a successor algorithm that relied solely on Deep Reinforcement Learning, learning from playing games with just itself exclusively. The neural network uses an image of the raw board state and the history of states, outputting a vector of probabilities for the agent taking each action next, and a vector of values which estimated the probability of the agent winning given their current state.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-aeedc9e elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"aeedc9e\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-5a31fb3\" data-id=\"5a31fb3\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-d8f67a3 elementor-widget elementor-widget-video\" data-id=\"d8f67a3\" data-element_type=\"widget\" data-e-type=\"widget\" data-settings=\"{&quot;youtube_url&quot;:&quot;https:\\\/\\\/www.youtube.com\\\/watch?v=WXuK6gekU1Y&quot;,&quot;video_type&quot;:&quot;youtube&quot;,&quot;controls&quot;:&quot;yes&quot;}\" data-widget_type=\"video.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<div class=\"elementor-wrapper elementor-open-inline\">\n\t\t\t<div class=\"elementor-video\"><\/div>\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-0dca7c2 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"0dca7c2\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-ffa4c43\" data-id=\"ffa4c43\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-0a19e26 elementor-widget elementor-widget-text-editor\" data-id=\"0a19e26\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>The occasion was a significant event, with many analysts predicting that Artificial Intelligence research was more than 10 years away from conquring the game of <em>Go<\/em>.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-12bef5a elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"12bef5a\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-54e9fd7\" data-id=\"54e9fd7\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-e19983a elementor-widget-divider--view-line elementor-widget elementor-widget-divider\" data-id=\"e19983a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"divider.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<div class=\"elementor-divider\">\n\t\t\t<span class=\"elementor-divider-separator\">\n\t\t\t\t\t\t<\/span>\n\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-451cf8c elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"451cf8c\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-8de218a\" data-id=\"8de218a\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-ed9c312 elementor-widget elementor-widget-text-editor\" data-id=\"ed9c312\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Thank-you for joining me through the steps of Reinforcement Learning, and I hope you found some of the power of RL enlightening. There might be one more blog to come from me on Reinforcement Learning discussing the current Endgame of RL research, although that will be much shorter a post if I decide to discuss it. Thank-you for your time and all the best,<\/p><p style=\"text-align: center\">&#8211; Jordan J Hood<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Last time we introduced that the limit to the traditional Q-Learning and Sarsa methods lies in the need to store&hellip;<\/p>\n","protected":false},"author":29,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"slim_seo":{"title":"Reinforcement Learning: The Ancient game of Go - Humans vs. AI. - Jordan J Hood","description":"Last time we introduced that the limit to the traditional Q-Learning and Sarsa methods lies in the need to store each Q-value for every state-action pair in som"},"footnotes":""},"categories":[3,5],"tags":[],"class_list":["post-487","post","type-post","status-publish","format-standard","hentry","category-academic","category-mres"],"_links":{"self":[{"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-json\/wp\/v2\/posts\/487","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-json\/wp\/v2\/users\/29"}],"replies":[{"embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-json\/wp\/v2\/comments?post=487"}],"version-history":[{"count":5,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-json\/wp\/v2\/posts\/487\/revisions"}],"predecessor-version":[{"id":503,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-json\/wp\/v2\/posts\/487\/revisions\/503"}],"wp:attachment":[{"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-json\/wp\/v2\/media?parent=487"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-json\/wp\/v2\/categories?post=487"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-json\/wp\/v2\/tags?post=487"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}