Research

My primary research interest for my PhD is the broad area of Markov Decision Processes, or MDPs, which can be used to model complex sequential decision-making problems. When modelling problems in this way, our goal is to find a “policy”, which is a function that maps any given “state” of a system to an optimal “action”, and we want this policy to minimise the long-run average cost (or maximise the long-run average reward) incurred by operating this system. For our application, we use this framework to model the problem of actively maintaining network-based infrastructure, such as pipelines or telecommunications networks. The standard approach to finding an optimal policy for an MDP is known as Dynamic Programming, which has existed since the 1950s. However, the complexity of our problem leads to the well-known “curse of dimensionality”, whereby the number of possible states that a system can be in grows exponentially in the number of components, ruling out Dynamic Programming as a suitable approach.

One approach to solving this problem instead focuses on Approximate Dynamic Programming, or ADP, which is also known as Reinforcement Learning in different academic communities. This approach can exploit the structure and theoretical understanding of a specific application, allowing for progress to be made with an otherwise intractable problem. A key component of this is Value Function Approximation, or VFA, where the value function gives a sort of long-run score of being in a state. Rather than storing the value function for each state exactly, VFA estimates values for all states based on an approximation architecture which bakes in our understanding of the problem, and is tuned via a learning process on a simulation. VFA is then used alongside different ADP algorithms such as the well-known Q-Learning, or lesser-known algorithms such as Approximate Value Iteration.

Our research must compare and contrast the policies learned from different VFA architectures trained in different ADP algorithms, and we hope to settle on one VFA and one algorithm which works well for any network structure. While certain simplifications are made to fit this problem into the MDP framework, we should also ensure that the policies we learn work well on more accurate simulations of the problem. On top of all this, we could also design heuristic approaches that don’t fall within the realm of ADP, but which offer alternative approaches to be compared against the ADP policies.