{"id":169,"date":"2025-01-30T12:00:00","date_gmt":"2025-01-30T12:00:00","guid":{"rendered":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/?p=169"},"modified":"2025-01-27T18:26:54","modified_gmt":"2025-01-27T18:26:54","slug":"learning-about-q-learning-part-2-double-q-learning","status":"publish","type":"post","link":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/2025\/01\/30\/learning-about-q-learning-part-2-double-q-learning\/","title":{"rendered":"Learning about Q-Learning (Part 2): Double Q-Learning"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">In the previous blog, we talked briefly about tabular Q-learning, however this method can be prone to noises within reward realisations. In this blog, we briefly cover two extensions to Q-learning, and how these ideas can be further extended in a more complex setting.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">One way we can hedge against overestimation bias caused by noise is to use a method known as <strong>double Q-learning<\/strong>. This method is analogous to tabular Q-learning, with the extension being having two tables to store <span class=\"wp-katex-eq\" data-display=\"false\">Q<\/span> values instead of one. <\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"434\" height=\"214\" src=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-content\/uploads\/sites\/66\/2025\/01\/image-4.png\" alt=\"\" class=\"wp-image-172\" style=\"width:456px;height:auto\" srcset=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-content\/uploads\/sites\/66\/2025\/01\/image-4.png 434w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-content\/uploads\/sites\/66\/2025\/01\/image-4-300x148.png 300w\" sizes=\"auto, (max-width: 434px) 100vw, 434px\" \/><figcaption class=\"wp-element-caption\">Image credit to Sutton and Barto (2018). In the case of Q-learning given positive initial generated values, we would explore action for B rather than A due to the nature of the greedy algorithms which is what is known as maximisation bias. Using double Q-learning reduces the risk of this happening.<\/figcaption><\/figure>\n<\/div>\n\n\n<p class=\"wp-block-paragraph\">The difference between this and tabular Q-learning other than having two tables to store <span class=\"wp-katex-eq\" data-display=\"false\">Q<\/span> values is within the update function, where we randomly update one <span class=\"wp-katex-eq\" data-display=\"false\">Q<\/span> function using the &#8220;optimal&#8221; future action which is derived from the other <span class=\"wp-katex-eq\" data-display=\"false\">Q<\/span> table. This can be defined by,<\/p>\n\n\n<span class=\"wp-katex-eq katex-display\" data-display=\"true\">Q_1(s,a) \\leftarrow Q_1(s,a) + \\alpha \\left[R(s,a) + \\gamma Q_1\\left(s&#039;, \\arg\\max_{a&#039;} Q_2(s&#039;,a&#039;)\\right) - Q_1(s,a) \\right], <\/span>\n\n\n<span class=\"wp-katex-eq katex-display\" data-display=\"true\">Q_2(s,a) \\leftarrow Q_2(s,a) + \\alpha \\left[R(s,a) + \\gamma Q_2\\left(s&#039;, \\arg\\max_{a&#039;} Q_1(s&#039;,a&#039;)\\right) - Q_2(s,a) \\right]. <\/span>\n\n\n\n<p class=\"wp-block-paragraph\">This method has been shown to have the same computational time as tabular Q-learning, but requires double the memory. Hence, this method struggles even further in scalability. However, the idea has been fundamental when applied to further extensions of Q-learning which aimed to deal with large state and\/or action space.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>References:<\/strong><br>Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press, 2nd edition.<br>Van Hasselt, H. (2010). Double Q-learning. Advances in neural information processing systems, 23.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In the previous blog, we talked briefly about tabular Q-learning, however this method can be prone to noises within reward&hellip;<\/p>\n","protected":false},"author":85,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[4,5,3],"class_list":["post-169","post","type-post","status-publish","format-standard","hentry","category-uncategorised","tag-dynamic-programming","tag-q-learning","tag-reinforcement-learning"],"_links":{"self":[{"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-json\/wp\/v2\/posts\/169","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-json\/wp\/v2\/users\/85"}],"replies":[{"embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-json\/wp\/v2\/comments?post=169"}],"version-history":[{"count":6,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-json\/wp\/v2\/posts\/169\/revisions"}],"predecessor-version":[{"id":178,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-json\/wp\/v2\/posts\/169\/revisions\/178"}],"wp:attachment":[{"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-json\/wp\/v2\/media?parent=169"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-json\/wp\/v2\/categories?post=169"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-json\/wp\/v2\/tags?post=169"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}