{"id":465,"date":"2021-03-27T19:11:31","date_gmt":"2021-03-27T19:11:31","guid":{"rendered":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/?p=465"},"modified":"2021-05-10T04:49:59","modified_gmt":"2021-05-10T04:49:59","slug":"reinforcement-learning-markov-decision-processes-mdps","status":"publish","type":"post","link":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/2021\/03\/27\/reinforcement-learning-markov-decision-processes-mdps\/","title":{"rendered":"Reinforcement Learning: Markov Decision Processes (MDPs)"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"465\" class=\"elementor elementor-465\">\n\t\t\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-19b6e24 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"19b6e24\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-22a53b9\" data-id=\"22a53b9\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-2908a76 elementor-widget elementor-widget-text-editor\" data-id=\"2908a76\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>For starters, what is Reinforcement Learning? When we learn in the real world, we are subconsciously aware of our surroundings and how they might respond to us when we take an action, the goal we wish to achieve via our actions, along with any repercussions our actions might have. Once we complete an action, we learn based on the consequences how effective our decisions were, and if necessary we adjust our actions for the future. In some sense, Reinforcement Learning hopes to replicate this more natural style of learning within a computational framework.\u00a0<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-bfe1d41 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"bfe1d41\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-c431ed2\" data-id=\"c431ed2\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-b1eebb0 elementor-widget-divider--view-line elementor-widget elementor-widget-divider\" data-id=\"b1eebb0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"divider.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<div class=\"elementor-divider\">\n\t\t\t<span class=\"elementor-divider-separator\">\n\t\t\t\t\t\t<\/span>\n\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-dea89bc elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"dea89bc\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-775caed\" data-id=\"775caed\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-ba562fd elementor-widget elementor-widget-text-editor\" data-id=\"ba562fd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><em>As a part of the MRes course we&#8217;re going off to dive into the Literature of an area we choose, with the aim to produce a longer report on the literature available. For me, this has been Reinforcement Learning. This will be the first in a series of posts for me to give a little bit of justice to the topic of RL.<\/em><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-6d970cc elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"6d970cc\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-3a2c1a4\" data-id=\"3a2c1a4\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-962eafd elementor-widget-divider--view-line elementor-widget elementor-widget-divider\" data-id=\"962eafd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"divider.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<div class=\"elementor-divider\">\n\t\t\t<span class=\"elementor-divider-separator\">\n\t\t\t\t\t\t<\/span>\n\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-75f58f6 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"75f58f6\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-befd26b\" data-id=\"befd26b\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-6ab238d elementor-widget elementor-widget-text-editor\" data-id=\"6ab238d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Let&#8217;s start with a little example I thought up while with a flatmate. If the goal is to recycle a can as fast as possible, one might simply walk over and place the can in the recycling. This option might be too slow for some people howevers, and alternatively one could throw the can at the recycling and hope it lands safely in the recycling bin. We are aware that while the second option could achieve the stated goal of correctly disposing of the can, and might save us a small amount of time, there exists the possible outcome in where the can lands somewhere on the floor. We then must gather up the item and place it into the recycling bin while groaning as your pasta secretly boils over in the background.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-94ea025 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"94ea025\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-b88d789\" data-id=\"b88d789\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-2ffe930 elementor-widget elementor-widget-image\" data-id=\"2ffe930\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img fetchpriority=\"high\" decoding=\"async\" width=\"688\" height=\"370\" src=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-content\/uploads\/sites\/28\/2021\/05\/recycle_ran.png\" class=\"attachment-large size-large wp-image-466\" alt=\"Recycle ran\" srcset=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-content\/uploads\/sites\/28\/2021\/05\/recycle_ran.png 691w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-content\/uploads\/sites\/28\/2021\/05\/recycle_ran-300x162.png 300w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-content\/uploads\/sites\/28\/2021\/05\/recycle_ran-24x13.png 24w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-content\/uploads\/sites\/28\/2021\/05\/recycle_ran-36x19.png 36w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-content\/uploads\/sites\/28\/2021\/05\/recycle_ran-48x26.png 48w\" sizes=\"(max-width: 688px) 100vw, 688px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Recycling Can Example<\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-b77dc01 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"b77dc01\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-e1671ce\" data-id=\"e1671ce\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-cc7d607 elementor-widget elementor-widget-text-editor\" data-id=\"cc7d607\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>While not the most life-changing example, the Recycling example does show how many of the basic things we do in life are all made up of the same core parts as our Reinforcement Learning (And Markov Decision) Problems.<\/p><p>We begin with an <strong>agent<\/strong> (The human in the recycling example) that must be able to make decisions within the scenario and an <strong>environment<\/strong> which the agent must be able to interact with it. This environment includes various <strong>states<\/strong> that describe the conditions that the agent is in (the recycling example has three states, when the can in hand, in the recycling bin or somewhere on the floor). The state encompasses all of the information that is available to the agent at a given point in time. The must be a set of <strong>actions<\/strong> that the agent can take depending on the state they are in (the recycling agent can choose between throwing the can or carrying it), and there must be a <strong>reward signal<\/strong> that the agent receives after every action they take. The reward signal can be negative as to act as a disincentive for certain actions (as in the recycling example)<\/p><p>This all in all is the basic framework for Markov Decision Processes.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-17ae509 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"17ae509\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-0cb0de6\" data-id=\"0cb0de6\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-a38d8e3 elementor-widget-divider--view-line elementor-widget elementor-widget-divider\" data-id=\"a38d8e3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"divider.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<div class=\"elementor-divider\">\n\t\t\t<span class=\"elementor-divider-separator\">\n\t\t\t\t\t\t<\/span>\n\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-115bd35 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"115bd35\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-7454750\" data-id=\"7454750\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-2de5c37 elementor-widget elementor-widget-text-editor\" data-id=\"2de5c37\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Markov Decision Processes (MDP) provide a classical formalisation for ordered decisions with stochastic components, and can be used to represent shortest path problems by constructing a general Markov decision problem. A Markov Decision Process relies on the notion of state, action, reward (just like above) and some transitional distribution for each action that describes how the agent moves between states. An MDP can be described as a controlled Markov chain, where the control is given at each step by the chosen action. The process then visits a sequence of states and can be evaluated through the observed rewards. Formally, we define the probability of transitioning between certain states, called the <em>dynamics<\/em> of the MDP, as the probability distribution:<\/p><p>\\begin{equation}\\label{dynamics}<br \/>\u00a0 p( s&#8217;, r | s , a ) = \\mathcal{P}\\{ S_t = s&#8217;, R_t = r | S_{t-1} = s,\u00a0 A_{t-1 = a}\\} .<br \/>\\end{equation}<\/p><p>This issue is, we generally are interested in the long term benefit we get from our actions rather than any short term benifit. We can then consider there to be some <strong>value function<\/strong> \\(v(s)\\) that describes the long term value being in a given state provides, which will let us navigate choosing an action in time. This value is generally given as the expected future reward \\(R_{t+k+1}\\) we will recieve after we&#8217;ve been in a state \\(S_t=s\\), discounted at each step by some fraction \\( \\gamma \\). This just means that for each jump in time the value is less and less impactful the further and further away in time it is, called the <strong>State-value function<\/strong>, is given as<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-faac2f5 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"faac2f5\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-484ce25\" data-id=\"484ce25\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-1789059 elementor-widget elementor-widget-text-editor\" data-id=\"1789059\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>\\begin{equation}\\label{state-value}<br \/>v_{\\pi}(s) = \\textbf{E}_{\\pi}[\\sum_{k=0}^{\\infty} \\gamma^{k} R_{t+k+1}|S_t=s].<br \/>\\end{equation}<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-ff705ea elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"ff705ea\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-3142707\" data-id=\"3142707\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-7b3cc0e elementor-widget elementor-widget-text-editor\" data-id=\"7b3cc0e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>There is similar motivation to define the value of any action \\(a \\in \\mathcal{A} \\) the agent might take while in state \\(s\\) following policy a decisions \\(\\pi\\), hence we can define the value of the state-action pair following policy \\(\\pi\\), \\(q_{\\pi}(s,a)\\), called the <strong>Value-Action function<\/strong> as<\/p><p>\\begin{equation}\\label{action-value}<br \/>q_{\\pi}(s,a) = \\textbf{E}_{\\pi}[\\sum_{k=0}^{\\infty} \\gamma^{k} R_{t+k+1}|S_t=s, A_t = a].<br \/>\\end{equation}<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-a32aeb9 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"a32aeb9\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-6de0296\" data-id=\"6de0296\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-e7849f4 elementor-widget-divider--view-line elementor-widget elementor-widget-divider\" data-id=\"e7849f4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"divider.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<div class=\"elementor-divider\">\n\t\t\t<span class=\"elementor-divider-separator\">\n\t\t\t\t\t\t<\/span>\n\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-9ebed1a elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"9ebed1a\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-540d4ba\" data-id=\"540d4ba\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-8e228b8 elementor-widget elementor-widget-text-editor\" data-id=\"8e228b8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>This brings us to the crux of MDPs. The true objective is to use our agent to find an optimal policy that maximises either \\(v(s)\\) or \\(q(s,a)\\). This is done by considering the Bellman Optimality Equations. This is the objective whe Solving MDPs, and almost all of traditional Reinfocement Learning is about estimating these state-value functions. The Bellman Optimality Equations highlight that if the agent is using an optimal policy, the value of a state must equal the expected return of the most valuable action from that state (if you&#8217;re doing the best then you&#8217;re making the best choices).The Bellman equations, which can be shown via some tomfoolery with the recursive nature of the Reward equation and the value equations, is given by<\/p><p>\\begin{equation}\\label{bell_v}<br \/>V_b(s) = max_\\pi v(s) = max_a \\sum_{s&#8217;,r} p( s&#8217;, r | s , a ) [r + \\gamma V_b(s&#8217;)]<br \/>\\end{equation}<\/p><p>or for the Action-Value function<\/p><p>\\begin{equation}\\label{bell_q}<br \/>q_b(s,a) = max_\\pi q(s,a) = \\sum_{s&#8217;,r} p( s&#8217;, r | s , a ) [r + \\gamma max_{a&#8217;} q_b(s&#8217;,a&#8217;)].<br \/>\\end{equation}<\/p><p>As any optimal policy would share both an optimal State-Value function and an Action-Value function, it is only necessary to solve one of the Bellman equations. For finite MDPs, the equations are guaranteed to have a unique solution that can be solved computationally as a set of non-linear equations, one for each state in the state space.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-a32b935 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"a32b935\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-74fb99a\" data-id=\"74fb99a\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-11091a0 elementor-widget-divider--view-line elementor-widget elementor-widget-divider\" data-id=\"11091a0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"divider.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<div class=\"elementor-divider\">\n\t\t\t<span class=\"elementor-divider-separator\">\n\t\t\t\t\t\t<\/span>\n\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-4bcfba6 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"4bcfba6\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-85fae8e\" data-id=\"85fae8e\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-fc6622b elementor-widget elementor-widget-text-editor\" data-id=\"fc6622b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>So while we have complete information on the problem, and while that state space isn&#8217;t <em>too<\/em> big, we can solve a set of non-linear equations to directly solve the MDP. However as you might expect, this isn&#8217;t always possible. Infact, for interesting problems this isn&#8217;t possible a lot of the time, so there is need in <em>learning<\/em> some estimate for these value equations&#8230; possibly through experience. See you the next post to talk about Reinforcement Learning proper! All the best,<\/p><p style=\"text-align: center\">&#8211; Jordan J Hood<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>For starters, what is Reinforcement Learning? When we learn in the real world, we are subconsciously aware of our surroundings&hellip;<\/p>\n","protected":false},"author":29,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"slim_seo":{"title":"Reinforcement Learning: Markov Decision Processes (MDPs) - Jordan J Hood","description":"For starters, what is Reinforcement Learning? When we learn in the real world, we are subconsciously aware of our surroundings and how they might respond to us"},"footnotes":""},"categories":[3,5],"tags":[],"class_list":["post-465","post","type-post","status-publish","format-standard","hentry","category-academic","category-mres"],"_links":{"self":[{"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-json\/wp\/v2\/posts\/465","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-json\/wp\/v2\/users\/29"}],"replies":[{"embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-json\/wp\/v2\/comments?post=465"}],"version-history":[{"count":7,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-json\/wp\/v2\/posts\/465\/revisions"}],"predecessor-version":[{"id":473,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-json\/wp\/v2\/posts\/465\/revisions\/473"}],"wp:attachment":[{"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-json\/wp\/v2\/media?parent=465"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-json\/wp\/v2\/categories?post=465"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jordan-j-hood\/wp-json\/wp\/v2\/tags?post=465"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}