{"id":474,"date":"2021-04-12T15:10:23","date_gmt":"2021-04-12T15:10:23","guid":{"rendered":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/?p=474"},"modified":"2021-05-10T09:54:01","modified_gmt":"2021-05-10T09:54:01","slug":"reinforcement-learning-temporal-difference-td-learning","status":"publish","type":"post","link":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/2021\/04\/12\/reinforcement-learning-temporal-difference-td-learning\/","title":{"rendered":"Reinforcement Learning: Temporal Difference (TD) Learning"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"474\" class=\"elementor elementor-474\">\n\t\t\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-6041451 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"6041451\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-8b61162\" data-id=\"8b61162\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-c8443d7 elementor-widget elementor-widget-text-editor\" data-id=\"c8443d7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><em>So following up from the last post, we are looking to estimate the value of different states and actions or agent can take as part of some stochastic process. But how? By learning from experience. Trial and Error, over potentially thousands or millions of simuated episodes.<br \/><\/em><\/p><p><em>In particular there are two methods in Reinforcement Learning that will show up everywhere, Q-Learning and SARSA, and you cannot read an RL text without encountering either of them or any of their bigger better cousins. Frankly, it would be rude of me <strong>not<\/strong> to talk about them here.<\/em><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-2add1d2 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"2add1d2\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-9f36c5c\" data-id=\"9f36c5c\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-fef7243 elementor-widget-divider--view-line elementor-widget elementor-widget-divider\" data-id=\"fef7243\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"divider.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<div class=\"elementor-divider\">\n\t\t\t<span class=\"elementor-divider-separator\">\n\t\t\t\t\t\t<\/span>\n\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-1ddcee4 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"1ddcee4\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-d693d1d\" data-id=\"d693d1d\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-e1294e3 elementor-widget elementor-widget-text-editor\" data-id=\"e1294e3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Temporal Difference (TD) learning is likely the most core concept in Reinforcement Learning. Temporal Difference learning, as the name suggests, focuses on the differences the agent experiences in time. The methods aim to, for some policy (\\ \\pi \\), provide and update some estimate \\(V\\) for the value of the policy \\(v_{\\pi}\\) for all states or state-action pairs, updating as the agent experiences them.<\/p><p>The most basic method for TD learning is the TD(0) method. Temporal-Difference TD(0) learning updates the estimated value of a state \\(V\\) for policy based on the reward the agent received and the value of the state it transitioned to. Specifically, if our agent is in a current state \\(s_{t}\\), takes the action \\(a_t\\) and receives the reward \\(r_t\\), then we update our estimate of \\(V\\) following<\/p><p>\\begin{equation}\\label{TD0}<br \/>V(s_t) \\xleftarrow[]{} V(s_t) + \\alpha[r_{t+1} + \\gamma V(s_{t+1}) &#8211; V(s_t)],<br \/>\\end{equation}<\/p><p>a simple diagram of which can be seen below. The value of \\([r_{t+1} + \\gamma V(s_{t+1}) &#8211; V(s_t)]\\) is commonly called the TD Error and is used in various forms through-out Reinforcement Learning. Here the TD error is the difference between the current estimate for \\(V_t\\), the discounted value estimate of \\(V_{t+1}\\) and the actual reward gained from transitioning between \\(s_t\\) and \\(s_{t+1}\\). Hence correcting the error in \\(V_t\\) slowly over many passes through. \\(\\alpha\\) is a constant step-size parameter that impacts how quickly the Temporal Difference algorithm learns. For the algorithms following, we generally require \\(\\alpha\\) to be suitably small to guarantee convergence, however the smaller the value of alpha the smaller the changes made for each update, and therefore the slower the convergence.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-b4ee91f elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"b4ee91f\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-174d36b\" data-id=\"174d36b\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-a7e1352 elementor-widget elementor-widget-image\" data-id=\"a7e1352\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img fetchpriority=\"high\" decoding=\"async\" width=\"688\" height=\"351\" src=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-content\/uploads\/sites\/28\/2021\/05\/TD_zero.png\" class=\"attachment-large size-large wp-image-477\" alt=\"TD zero\" srcset=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-content\/uploads\/sites\/28\/2021\/05\/TD_zero.png 797w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-content\/uploads\/sites\/28\/2021\/05\/TD_zero-300x153.png 300w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-content\/uploads\/sites\/28\/2021\/05\/TD_zero-768x392.png 768w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-content\/uploads\/sites\/28\/2021\/05\/TD_zero-24x12.png 24w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-content\/uploads\/sites\/28\/2021\/05\/TD_zero-36x18.png 36w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-content\/uploads\/sites\/28\/2021\/05\/TD_zero-48x25.png 48w\" sizes=\"(max-width: 688px) 100vw, 688px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Example of the TD(0) Update<\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-3314441 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"3314441\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-f999b74\" data-id=\"f999b74\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-4975176 elementor-widget elementor-widget-text-editor\" data-id=\"4975176\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Temporal Difference learning is just trying to estimate the value function \\(v_{\\pi}(s_t)\\), as an estimate of how much the agent wants to be in certain state, which we repeatedly improve via the reward outcome and the current estimate of \\(v_{\\pi}(s_{t+1})\\). This way, the estimate of the current state relies on the estimates of all future states, so information slowly trickles down over many runs through the chain.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-5f51ccc elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"5f51ccc\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-a9f1d9e\" data-id=\"a9f1d9e\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-d253e3c elementor-widget-divider--view-line elementor-widget elementor-widget-divider\" data-id=\"d253e3c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"divider.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<div class=\"elementor-divider\">\n\t\t\t<span class=\"elementor-divider-separator\">\n\t\t\t\t\t\t<\/span>\n\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-b270fdd elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"b270fdd\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-97e7b2e\" data-id=\"97e7b2e\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-86478ef elementor-widget elementor-widget-text-editor\" data-id=\"86478ef\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Rather than estimating the state-value function, it is commonly more effective to estimate the Action-Value pair for a particular policy \\(q_{\\pi} (s,a)\\), for \\( s \\in S \\) and \\(a \\in A\\), commonly referred to as Q-Values (because a certain something came first). These are typically stored in an array, each cell referring to a specific state-action Q-Value.<\/p><p>Q-Learning is arguably thee most popular Reinforcement Learning Policy method.\u00a0 Formally it is an Off-policy Temporal Difference Control Method, but I just want to introduce the method here. It is particularly popular as it has a formula that is both simplistic to follow and compute. The aim is to learn an estimate \\(Q(s,a)\\) of the optimal \\(q_{*}(s,a)\\) by having our agent play through and experience our series of actions and states, updating our estimates following<\/p><p>\\begin{equation}\\label{Q_Learning}<br \/>Q(s_t,a_t) \\xleftarrow[]{} Q(s_t,a_t) + \\alpha[r_{t+1} + \\gamma max_{a} Q(s_{t+1},a) &#8211; Q(s_t,a_t)].<br \/>\\end{equation}<\/p><p>Here our Q-Values are estimated by comparing the current Q-Value to the reward gained plus the maximal greedy option available to our agent during the next state \\(s_{t+1}\\) (A similar figure to the one for TD(0) is below), and hence we can calculate our estimated action-value function \\(Q(s,a)\\) directly. This estimate is all independent of the policy currently being followed. The policy the agent currently follows only impacts which states will be visited upon selecting our action in the new state and moving there-after. Q-learning performs updates only as a function of the seemingly optimal actions, regardless of what action will be chosen.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-e277750 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"e277750\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-b8ff1f2\" data-id=\"b8ff1f2\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-54b129e elementor-widget elementor-widget-image\" data-id=\"54b129e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" width=\"688\" height=\"367\" src=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-content\/uploads\/sites\/28\/2021\/05\/Q_Learny.png\" class=\"attachment-large size-large wp-image-481\" alt=\"Q Learny\" srcset=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-content\/uploads\/sites\/28\/2021\/05\/Q_Learny.png 768w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-content\/uploads\/sites\/28\/2021\/05\/Q_Learny-300x160.png 300w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-content\/uploads\/sites\/28\/2021\/05\/Q_Learny-24x13.png 24w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-content\/uploads\/sites\/28\/2021\/05\/Q_Learny-36x19.png 36w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-content\/uploads\/sites\/28\/2021\/05\/Q_Learny-48x26.png 48w\" sizes=\"(max-width: 688px) 100vw, 688px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Example of the Q-Learning update<\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-0800f84 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"0800f84\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-5ac75a6\" data-id=\"5ac75a6\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-8fff904 elementor-widget-divider--view-line elementor-widget elementor-widget-divider\" data-id=\"8fff904\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"divider.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<div class=\"elementor-divider\">\n\t\t\t<span class=\"elementor-divider-separator\">\n\t\t\t\t\t\t<\/span>\n\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-d5f73f4 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"d5f73f4\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-378a259\" data-id=\"378a259\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-84116f2 elementor-widget elementor-widget-text-editor\" data-id=\"84116f2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>SARSA is an on-policy Temporal Difference control method and can be seen as a more complex Q-Learning method. By <em>on-policy<\/em>, we refer to the idea that the estimate of \\(q_{\\pi}(s_t,a_t)\\) is dependent on our current policy \\(\\pi\\) and we assume when we make the update that we will continue with \\(\\pi\\) for the remainder of the agents current episode, whatever states or actions that might include them choosing. For the Sarsa Method, we make the update<\/p><p>\\begin{equation}\\label{SARSA}<br \/>Q(s_t,a_t) \\xleftarrow[]{} Q(s_t,a_t) + \\alpha[r_{t+1} + \\gamma Q(s_{t+1},a_{t+1}) &#8211; Q(s_t,a_t)].<br \/>\\end{equation}<\/p><p>The SARSA algorithm has one conceptual problem, in that when updating we imply we know in advance what the next action \\(a_{t+1}\\) is for any possible next state. This requires that we step forward and calculate the next action of our policy when updating, and therefore learning is highly dependent on the current policy the agent is following. This complicates the exploration process, and it is therefore common to use some form of \\( \\epsilon -soft \\) policy for on-policy methods.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-290147f elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"290147f\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-b674a26\" data-id=\"b674a26\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-f97e5bf elementor-widget-divider--view-line elementor-widget elementor-widget-divider\" data-id=\"f97e5bf\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"divider.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<div class=\"elementor-divider\">\n\t\t\t<span class=\"elementor-divider-separator\">\n\t\t\t\t\t\t<\/span>\n\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-5b454f3 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"5b454f3\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-5ea5f9a\" data-id=\"5ea5f9a\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-a870358 elementor-widget elementor-widget-text-editor\" data-id=\"a870358\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>But which one is better? The cliff walking example is commonly used to compare Q-Learning and SARSA policy methods, originally found in various editions of Sutton &amp; Barto (2018), and can be found in various other texts discussing the differences between Q-Learning and Sarsa such as Dangeti (2017) who also provides a fully working python example. Here, our agent may move in any cardinal direction at the cost of one unit, and if they fall off of the cliff they incur a cost of 100. Our aim is to find the shortest path across. The cliff example highlights the differences caused by the TD error terms between the two methods. Q-Learning could be seen as the greedy method as it always evaluates using the greedy option of the next state regardless of the current policy, and so as it walks the optimal path there exists a small random chance of choosing to fall of the cliff where the agent is punished significantly. Sarsa, however, always routes the longer yet safer path. The example shows that, over many episodes, SARSA regularly outperforms Q-Learning for this example.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-b754d94 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"b754d94\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-4f4dbde\" data-id=\"4f4dbde\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-1d83c1e elementor-widget elementor-widget-image\" data-id=\"1d83c1e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" width=\"688\" height=\"370\" src=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-content\/uploads\/sites\/28\/2021\/05\/cliff.png\" class=\"attachment-large size-large wp-image-483\" alt=\"Cliff\" srcset=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-content\/uploads\/sites\/28\/2021\/05\/cliff.png 786w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-content\/uploads\/sites\/28\/2021\/05\/cliff-300x161.png 300w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-content\/uploads\/sites\/28\/2021\/05\/cliff-768x413.png 768w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-content\/uploads\/sites\/28\/2021\/05\/cliff-24x13.png 24w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-content\/uploads\/sites\/28\/2021\/05\/cliff-36x19.png 36w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-content\/uploads\/sites\/28\/2021\/05\/cliff-48x26.png 48w\" sizes=\"(max-width: 688px) 100vw, 688px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">SARSA vs Q-Learning (Dangeti 2017)<\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-94256f9 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"94256f9\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-4c372fd\" data-id=\"4c372fd\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-c4384a9 elementor-widget-divider--view-line elementor-widget elementor-widget-divider\" data-id=\"c4384a9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"divider.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<div class=\"elementor-divider\">\n\t\t\t<span class=\"elementor-divider-separator\">\n\t\t\t\t\t\t<\/span>\n\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-679a448 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"679a448\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-2dcc476\" data-id=\"2dcc476\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-30bd98d elementor-widget elementor-widget-text-editor\" data-id=\"30bd98d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>So what&#8217;s the limit then? Well both of the two big-dogs of Reinforcement Learning have many improvements: Double learning to reduce bias and TD(\\( \\lambda \\)) methods to improve convergence, but they all are limited by storing our Q-Values in some great table or array. Now this is fine for millions, even tens-of-millions of states, but there are \\(10^{170} \\) unique states in the ancient boardgame <em>Go<\/em>. Next time, we&#8217;ll be taking a look into combining Q-Learning with Artificial Neural Networks, and how computers finally trumped humans in the ancient boardgame <em>Go<\/em>. All the best,<\/p>\n<p style=\"text-align: center\">\u2013 Jordan J Hood<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>So following up from the last post, we are looking to estimate the value of different states and actions or&hellip;<\/p>\n","protected":false},"author":29,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"slim_seo":{"title":"Reinforcement Learning: Temporal Difference (TD) Learning - Jordan J Hood","description":"So following up from the last post, we are looking to estimate the value of different states and actions or agent can take as part of some stochastic process. B"},"footnotes":""},"categories":[3,5],"tags":[],"class_list":["post-474","post","type-post","status-publish","format-standard","hentry","category-academic","category-mres"],"_links":{"self":[{"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-json\/wp\/v2\/posts\/474","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-json\/wp\/v2\/users\/29"}],"replies":[{"embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-json\/wp\/v2\/comments?post=474"}],"version-history":[{"count":13,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-json\/wp\/v2\/posts\/474\/revisions"}],"predecessor-version":[{"id":497,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-json\/wp\/v2\/posts\/474\/revisions\/497"}],"wp:attachment":[{"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-json\/wp\/v2\/media?parent=474"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-json\/wp\/v2\/categories?post=474"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-json\/wp\/v2\/tags?post=474"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}