PhD: Ranking Systems in Sport

Listen to a recent talk, presented at the Oct 9, 2020 meeting of the Sports Analytics Lab at Harvard University/American Statistical Association Section on Statistics in Sports seminar series.

A brief clip from the Royal Statistical Society's audio podcast series, where I speak to Brian Tarran at the 2019 RSS conference in Belfast.

Lancaster University Sports Statistics Motivation


Pick your favourite sport. Chances are, you have an opinion on who the best in the world is, at this current moment, or of all time, or who would win if A played B. But is it possible to develop a system which returns an objective answer to these questions?

In developing such systems, it is crucial to capture as much information as possible about the dynamic world in which we live. Understand it. Learn from it. Predict it. Athlete’s injuries, the weather, and even economic factors all impact the outcome of these events and the implied ability of the athletes or teams. This project requires a wide range of strategies in order to capture these signals, from graph theory to extreme value theory, and contextual information from news websites, so that the most accurate system of ranking sports teams or athletes is formulated.

Ranking systems in sport are not only interesting to the inquisitive fan, but a fair and accurate system is at the core of all sports organisational bodies and the multi-billion pound industries that they represent.

But these systems are not exclusive to sports.

Methodological advances in the field of sports ranking systems have far-reaching consequences. Ranking systems are used to rank webpages, or to rank schools and hospitals, or even to determine the most essential medical treatments. So, a ranking system based on poor methodology can have much more severe repercussions than incorrectly seeding a tennis tournament… Ultimately, the importance of ranking systems is self-evident, and sport creates a fruitful playground in which ample advancements can be made.


Abstract: The International Swimming Federation (FINA) uses a very simple points system with the aim to rank swimmers across all Olympic events. The points acquired is a function of the ratio of the recorded time and the current world record for that event. With some world records considered better than others however, bias is introduced between events, with some being much harder to attain points where the world record is hard to beat. A model based on extreme value theory will be introduced, where swim-times are modelled through their rate of occurrence, and with the distribution of the best times following a generalised Pareto distribution. Within this framework, the strength of a particular swim is judged based on its position compared to the whole distribution of swim-times, rather than just the world record. This model also accounts for the date of the swim, as training methods improve over the years, as well as changes in technology, such as full body suits. The parameters of the generalised Pareto distribution, for each of the 34 individual Olympic events, will be shown to vary with a covariate, leading to a novel single unified description of swim quality over all events and time. This structure, which allows information to be shared across all strokes, distances, and genders, improves the predictive power as well as the model robustness compared to equivalent independent models. A by-product of the model is that it is possible to estimate other features of interest, such as the ultimate possible time, the distribution of new world records for any event, and to correct swim times for the effect of full body suits. The methods will be illustrated using a dataset of the best 500 swim-times for each event in the period 2001-2018.

Previous mini-projects

Exploring Solutions to the Stochastic Multi-Armed Bandit Problem

This report begins with an introduction to machine learning and its various classifications, and the crossover between machine learning and statistical learning in multi-armed bandit problems. A literature review of solutions to multi-armed bandit problems are presented: the ε-greedy algorithm, Thompson sampling algorithms, and Upper-Confidence-Bound (UCB) algorithms. A simulation is then formulated and coded in R and tested on a 10-armed Bernoulli bandit to compare these methods along with an algorithm introduced in this report, the greedy-η(t) algorithm. It is shown that the greedy-η(t) algorithm outperforms the standard ε-greedy algorithm after approximately a 10th of the horizon time, and will always outperform the ε-greedy algorithm asymptotically, independent of parameter selection. In a macro test of the code, it is shown that the greedy-η algorithm is more robust and its performance is less dependent on prior information. The simulation is based in a Bayesian setting with a Beta posterior distribution. View the R code for this simulation. Finally a comparison of these methods is presented based on theoretical guarantees on regret bounds, the simulation in this report, and other simulations in the literature.

View article.

Sequential Monte Carlo Methods in State Space Models.

This report formulates the mathematics behind hidden Markov models, before introducing and simulating a Kalman filter. Later, approximate techniques are presented and applied to a hidden Markov model, with a simulation of one particular example, the Bootstrap filter. Throughout this report it is assumed that the parameters in the models are known, however in practice this is not the case.

View article.

Markov Chains Applied to Time Dependent Queuing Systems.

Continuous Time Markov chains are applied to an M(t)/M/S queue using a transition rate matrix and the expected number in the queue E(n(t)) is tracked over time. This is tested on two different arrival rate functions and the experimental findings agree with the theoretical values. The upper 10 % tail of the distribution of the number in the queue varies in a similar way to the expected number E(n(t)) in the queue. For 3 servers, this peaks at 97 in our example. We conclude that the number of servers S, has more effect on E(n(t)) than on the upper 10 % tail of the distribution. Discrete Time Markov Decision Chains are applied to and M(t)/M/1 queue and backwards recursion is used to find a closed form for the optimal policy.

View article.

The Existence, Nature, and Effect of Dark Energy.

The cosmic age problem is approached by comparing observations of globular clusters with the theoretical age of the Universe derived from the Friedmann equation, assuming a cosmological constant and a Universe which is flat and matter dominated. My data concludes that there is indeed a need for dark energy. The possibility of a cyclic Universe is considered, and is shown how this can solve many current problems in the Standard Cosmological Model such as the flatness problem, the coincidence problem, the horizon problem, and the dark energy problem.

View article.

Further topics

For a light read into a broad range of topics in statistics and operational research, visit my blog.


Harry Spearing Royal Statistical Society Belfast 2019

At 24 I am now embarking on the final year of my PhD, reading statistics and operational research, via the The STOR-i programme at Lancaster Univesity.

I am currently supervised by Professor Jonathan Tawn, at the department of Mathematics and Statistics at Lancaster Univesrity. My research concentrates on statistical modelling of sports, in particular ranking and prediction. More generally, I am interested in using a broad range of mathematical and statistical techniques to help understand areas such as climatology, oceanography, medicine and sport.

Hobbies and Interests

I have played the violin and drums from an early age and enjoy playing in various ensembles, be that orchestral, jazz, or rock. Check out the below clip of my band, Vedanna, which was recorded during Manchester's battle of the bands!

I'm passionate about anything outdoors, and love exploring via cycling, walking or wild swimming, whatever the weather!