{"id":135,"date":"2025-01-27T17:03:40","date_gmt":"2025-01-27T17:03:40","guid":{"rendered":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/?p=135"},"modified":"2025-01-27T22:47:10","modified_gmt":"2025-01-27T22:47:10","slug":"learning-about-q-learning-part-1","status":"publish","type":"post","link":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/2025\/01\/27\/learning-about-q-learning-part-1\/","title":{"rendered":"Learning about Q-Learning (Part 1): Tabular Q-Learning"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Often times whether we are thinking about individuals or as a business, we may have to make decisions while considering future events and decision times. An example could be whether we decide to spend money today, or save money for tomorrow. This can be modelled by what is known as a Markov Decision Process (MDP). Here, we have a model with our current state and at certain time periods, we have decisions to make which is associated with probability of transitions to another state.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"610\" height=\"580\" src=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-content\/uploads\/sites\/66\/2025\/01\/image.png\" alt=\"\" class=\"wp-image-136\" style=\"width:426px;height:auto\" srcset=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-content\/uploads\/sites\/66\/2025\/01\/image.png 610w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-content\/uploads\/sites\/66\/2025\/01\/image-300x285.png 300w\" sizes=\"auto, (max-width: 610px) 100vw, 610px\" \/><figcaption class=\"wp-element-caption\">Sample Sketch of a Markov Decision Process. Note that an MDP does not necessarily require a finite end point despite the diagram.<\/figcaption><\/figure>\n<\/div>\n\n\n<p class=\"wp-block-paragraph\">An example of this could be in a shortest path problem. Say we want to get from place A to B within a road network, our decision periods is at each junction where we choose to say either turn left, go straight, or turn right. Each decision may have an influence on future decisions, and so this can be modelled as an MDP.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"868\" height=\"571\" src=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-content\/uploads\/sites\/66\/2025\/01\/image-3.png\" alt=\"\" class=\"wp-image-161\" style=\"width:508px;height:auto\" srcset=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-content\/uploads\/sites\/66\/2025\/01\/image-3.png 868w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-content\/uploads\/sites\/66\/2025\/01\/image-3-300x197.png 300w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-content\/uploads\/sites\/66\/2025\/01\/image-3-768x505.png 768w\" sizes=\"auto, (max-width: 868px) 100vw, 868px\" \/><figcaption class=\"wp-element-caption\">Sketch of Shortest-Path Problem in a road network (left) and decision epoch at initial state (right).<\/figcaption><\/figure>\n<\/div>\n\n\n<p class=\"wp-block-paragraph\">An approach to solving an exact solution to an MDP is by solving the Bellman&#8217;s equation, but this is often times very computationally difficult for higher dimensions, as well as requiring knowledge of the model. This is why we want to opt for approximate alternatives which can deal with high dimensions and is model-free. The first approach to this is often a reinforcement learning framework, which models an agent interacting with its environment.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"405\" height=\"259\" src=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-content\/uploads\/sites\/66\/2025\/01\/image-1.png\" alt=\"\" class=\"wp-image-138\" srcset=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-content\/uploads\/sites\/66\/2025\/01\/image-1.png 405w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-content\/uploads\/sites\/66\/2025\/01\/image-1-300x192.png 300w\" sizes=\"auto, (max-width: 405px) 100vw, 405px\" \/><figcaption class=\"wp-element-caption\">Sketch of the Reinforcement Learning Framework.<\/figcaption><\/figure>\n<\/div>\n\n\n<p class=\"wp-block-paragraph\">Q-Learning is a common approach to solving MDPs which utilises a form similar to Bellman&#8217;s equation, by defining <span class=\"wp-katex-eq\" data-display=\"false\">Q(s,a)<\/span> similar to a value as a function of a state and action pair. This formula is defined by the expected value of the reward after taking the action in the current state, plus the expected value of the rewards in future decision periods,<\/p>\n\n\n<span class=\"wp-katex-eq katex-display\" data-display=\"true\"> Q(s,a) = \\mathbb{E}\\left[\\sum_{t=0}^{T} \\gamma^t R(s,a) \\:|\\: s_0 = s, a_0 = a\\right]. <\/span>\n\n\n\n<p class=\"has-text-align-left wp-block-paragraph\">Our goal is to find a <span class=\"wp-katex-eq\" data-display=\"false\">Q_*(s,a)<\/span> which maximises the above function, and the actions associated with the maximum <span class=\"wp-katex-eq\" data-display=\"false\">Q<\/span>. This <span class=\"wp-katex-eq\" data-display=\"false\">Q<\/span> function can be simplified to the form,<\/p>\n\n\n<span class=\"wp-katex-eq katex-display\" data-display=\"true\"> Q_*(s,a) = \\mathbb{E}\\left[R(s,a) + \\gamma\\max_{a&#039;}Q_*(s&#039;,a&#039;)\\right].<\/span>\n\n\n\n<p class=\"wp-block-paragraph\">This equation can be interpreted as for a given state and action pair, its value is the expected value of reward obtained in the current period, plus the discounted value of all future events given that the optimal action taken. Note that <span class=\"wp-katex-eq\" data-display=\"false\">s&#039;<\/span> and <span class=\"wp-katex-eq\" data-display=\"false\">a&#039;<\/span> denotes the state and action corresponding to the following period.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To learn the <span class=\"wp-katex-eq\" data-display=\"false\">Q<\/span> function values, we typically run simulations and adjust our estimates of <span class=\"wp-katex-eq\" data-display=\"false\">Q<\/span> accordingly. In Q-Learning, as opposed to the idea within exact dynamic programming solutions, we focus our exploration on states which are more commonly plausible, but how do we start learning? How do we select our actions taken so we can explore different actions whilst maintaining maximum profits? This problem is commonly referred to as exploration-exploitation trade-off. I will explain the <span class=\"wp-katex-eq\" data-display=\"false\">\\epsilon<\/span>-greedy heuristic approach to this problem, but there are a multitude of different methods explored in literature which could be explored in a later blog.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">What we do in an <span class=\"wp-katex-eq\" data-display=\"false\">\\epsilon<\/span>-greedy framework is to pick based on the <span class=\"wp-katex-eq\" data-display=\"false\">Q(s,a)<\/span> value for a fixed <span class=\"wp-katex-eq\" data-display=\"false\">s<\/span>. The idea is simple, we exploit (pick the best option) with probability <span class=\"wp-katex-eq\" data-display=\"false\">\\epsilon<\/span>, and explore with probability <span class=\"wp-katex-eq\" data-display=\"false\">1-\\epsilon<\/span>.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"400\" src=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-content\/uploads\/sites\/66\/2025\/01\/image-5-1024x400.png\" alt=\"\" class=\"wp-image-180\" style=\"width:694px;height:auto\" srcset=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-content\/uploads\/sites\/66\/2025\/01\/image-5-1024x400.png 1024w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-content\/uploads\/sites\/66\/2025\/01\/image-5-300x117.png 300w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-content\/uploads\/sites\/66\/2025\/01\/image-5-768x300.png 768w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-content\/uploads\/sites\/66\/2025\/01\/image-5.png 1178w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Illustration of <span class=\"wp-katex-eq\" data-display=\"false\">\\epsilon<\/span>-greedy selection. Exploit (left) is selected with probability <span class=\"wp-katex-eq\" data-display=\"false\">\\epsilon<\/span> and explore(right) is selected with probability <span class=\"wp-katex-eq\" data-display=\"false\">1-\\epsilon<\/span>.<\/figcaption><\/figure>\n<\/div>\n\n\n<p class=\"wp-block-paragraph\">The chosen value for <span class=\"wp-katex-eq\" data-display=\"false\">\\epsilon \\in (0,1]<\/span> is subjective. One can choose the <span class=\"wp-katex-eq\" data-display=\"false\">\\epsilon<\/span> as a constant function, or what is highly recommended often times is to use a form of a decay function, so that we are choosing to explore less and less as we go along. Note however theory says that for convergence of Q-learning to the optimal policy, each state-action pair should be expected to be visited infinitely often.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The most basic form of Q-learning is known as tabular Q-learning. Similar to the above illustration, we would store each <span class=\"wp-katex-eq\" data-display=\"false\">Q<\/span> value within a table corresponding to its current state and its associated action. The initialisation of this is typically using a table of zeroes. Using temporal differencing, we can update the <span class=\"wp-katex-eq\" data-display=\"false\">Q<\/span> function through each iteration based on the observed reward of our current action plus the current optimal action for the future value, compared to the current <span class=\"wp-katex-eq\" data-display=\"false\">Q<\/span> value. Define the <span class=\"wp-katex-eq\" data-display=\"false\">Q<\/span> update as<\/p>\n\n\n<span class=\"wp-katex-eq katex-display\" data-display=\"true\"> Q(s,a) \\leftarrow Q(s,a) + \\alpha\\left[R(s,a) + \\max_{a&#039;} Q(s&#039;,a&#039;) - Q(s,a)\\right] <\/span>\n\n\n\n<p class=\"wp-block-paragraph\">Here, <span class=\"wp-katex-eq\" data-display=\"false\">\\alpha<\/span> is the learning rate parameter, and is a tuneable parameter. We should be careful to set this too high, but it is sufficient in this case to set it as a constant parameter, but one may also choose to set this to a decay function as we may want to learn less over time. For convergence criterion, we only require that <span class=\"wp-katex-eq\" data-display=\"false\">\\sum_{i=0}^\\infty \\alpha_i \\rightarrow \\infty<\/span>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">And that is it, a simple introduction to tabular Q-learning. The only issue is there are many practical challenges that comes with using Q-learning. First of all, what if there are many state action pairs? Often time we may be required to set the state component into multiple dimensions or we may have a continuous set of actions, therefore memory will play a big part. We will talk more about potential extensions in a second part of Learning About Q-Learning.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Recommended Further Reading Material<\/strong>:<br>Bertsekas, D. P. (2007). Dynamic Programming and Optimal Control, Vol. I &amp; II. Athena Scientific, Belmont, MA, 3rd edition.<br>Murphy, K. (2024). Reinforcement learning: An overview. arXiv preprint arXiv:2412.05265.<br>Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press, 2nd edition.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Often times whether we are thinking about individuals or as a business, we may have to make decisions while considering&hellip;<\/p>\n","protected":false},"author":85,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[4,5,3],"class_list":["post-135","post","type-post","status-publish","format-standard","hentry","category-uncategorised","tag-dynamic-programming","tag-q-learning","tag-reinforcement-learning"],"_links":{"self":[{"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-json\/wp\/v2\/posts\/135","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-json\/wp\/v2\/users\/85"}],"replies":[{"embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-json\/wp\/v2\/comments?post=135"}],"version-history":[{"count":34,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-json\/wp\/v2\/posts\/135\/revisions"}],"predecessor-version":[{"id":188,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-json\/wp\/v2\/posts\/135\/revisions\/188"}],"wp:attachment":[{"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-json\/wp\/v2\/media?parent=135"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-json\/wp\/v2\/categories?post=135"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-json\/wp\/v2\/tags?post=135"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}