Learning about Q-Learning (Part 2): Double Q-Learning

In the previous blog, we talked briefly about tabular Q-learning, however this method can be prone to noises within reward realisations. In this blog, we briefly cover two extensions to Q-learning, and how these ideas can be further extended in a more complex setting.

One way we can hedge against overestimation bias caused by noise is to use a method known as double Q-learning. This method is analogous to tabular Q-learning, with the extension being having two tables to store $Q$ values instead of one.

Image credit to Sutton and Barto (2018). In the case of Q-learning given positive initial generated values, we would explore action for B rather than A due to the nature of the greedy algorithms which is what is known as maximisation bias. Using double Q-learning reduces the risk of this happening.

The difference between this and tabular Q-learning other than having two tables to store $Q$ values is within the update function, where we randomly update one $Q$ function using the “optimal” future action which is derived from the other $Q$ table. This can be defined by,

Q_1(s,a) \leftarrow Q_1(s,a) + \alpha \left[R(s,a) + \gamma Q_1\left(s', \arg\max_{a'} Q_2(s',a')\right) - Q_1(s,a) \right],

Q_2(s,a) \leftarrow Q_2(s,a) + \alpha \left[R(s,a) + \gamma Q_2\left(s', \arg\max_{a'} Q_1(s',a')\right) - Q_2(s,a) \right].

This method has been shown to have the same computational time as tabular Q-learning, but requires double the memory. Hence, this method struggles even further in scalability. However, the idea has been fundamental when applied to further extensions of Q-learning which aimed to deal with large state and/or action space.

References:
Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press, 2nd edition.
Van Hasselt, H. (2010). Double Q-learning. Advances in neural information processing systems, 23.

Blog

Learning about Q-Learning (Part 2): Double Q-Learning

Leave a Reply Cancel reply