Below you can find details of the summer 2018 interns including a description of their research project.
Lancaster University, BSc Mathematics
Supervisor: Sarah Oscroft
Estimation of Diffusivity in the Ocean
Diffusivity plays an important role in many real world problems, such as recovering missing objects lost at sea or predicting how an oil spill will spread. Specifically, it measures the rate at which particles spread out over time, for instance organisms or sediments transported through water. We can estimate diffusivity using satellite-tracked drifting instruments known as drifters. However, the ocean is highly unpredictable – two particles that start at the same location at the same time can end up following completely different paths to very different locations. This requires a statistical approach for the estimation of diffusivity.
Current techniques for estimating diffusivity provide inconsistent results so through statistical research, we aim to improve these techniques. My project compares some of these different methods and uses these to estimate diffusivity for a part of the ocean using real data collected by the global drifter program. This project applies time series techniques, with a particular focus on spectral analysis. I have used MATLAB to compare different estimators using both simulated and real data before plotting my results.
Lancaster University, BSc Mathematics
Supervisor: Zak Varty
Investigating models for potential self-excitation
This project explores models for which the data points occur randomly in space and time. The aim of this type of data is to model the locations of data points or events in addition to any information or marks associated with each occurrence. This can be achieved through point process models. The simplest example of this is the homogenous poisson process. In homogenous poisson process model events occur independently at random with a uniform intensity.
The first aim of the project is to look at methods for assessing the validity of the assumptions for any data set to fit the homogenous poisson process model where the assumptions are satisfied. The next aim is to study complex data sets where the assumptions made no longer hold. Then to use different models which have fewer or weaker assumptions and the subsequently assessing any improvements in the model fit.
During the project there is a choice of two data sets. The first of which is about armed conflicts across the globe. The second was about earthquakes above magnitude 1.5 in the Netherlands. For which the events are induced by gas extraction from the reservoir below the region.
London School of Economics, BSc Statistics with Finance
Supervisor: Hankui Peng
Clustering On Web-Scraped Data
The Office for National Statistics (ONS) are currently experimenting with new data sources to improve the representativeness of the Consumer Price Index (CPI), which is the official indicator for the inflation and deflation rates for the country. Web-scraped data is considered as a promising data source that come in huge volume and can be scraped easily and at high frequency. Therefore, if could incorporate web-scraped data into the index generating procedure, then price indices could be generated more effectively and at higher frequency.
However, web-scraped data do not always come in a way that can be immediately used for price index generation. The category labels for web-scraped prices usually follow the website categorisation that the data are scraped from, which does not necessarily match the categorisation that is used for the national price index generation. Also, some product information (product name, price, etc.) might be incorrectly scraped, due to the quality of the web-scrapers.
Clustering methods are a useful tool for tackling the aforementioned challenges that come with web-scraped data. The problems that we are interested in include both recognising the main clusters of products, given the web-scraped data as well as identifying the incorrectly scraped products. In this project, we will start by exploring the fundamental clustering methods that exist in the literature (k-means and spectral clustering methods, in particular). At a further stage, we will apply this techniques on a web-scraped dataset. Clustering performance evaluation shall be carried out to compare the existing methods and further extensions to the existing techniques shall be explored.
The University of Cambridge, BA Natural Sciences
Supervisor: Euan Mcgonigle
Investigating Trend in the Locally Stationary Wavelet Model
Outside of neat theoretical settings, time series are most commonly non-stationary. In fields from finance to biomedical statistics, time series rarely occur which have constant mean and/or autocovariance.
Wavelets are a class of oscillatory functions which are well localised in both time and frequency, allowing wavelet based transforms to capture information in a time series by examining it over a range of time scales. One prominent method for doing so with non-stationary time series is the locally stationary wavelet (LSW) model of Nason et al. (2000). Time series in the LSW model are assumed to be zero-mean. In practise this is rarely the case. Our aim is to explore the behaviour of the model when this assumption is weakened by investigating the effect of different trends on the LSW estimate of the wavelet spectrum.
We also plan to examine the treatment of boundary effects that appear in the wavelet coefficients of data near the end points of the time series. The time series are usually assumed to be periodic, however this too is a poor assumption in most non-zero mean cases. Our project will attempt to analyse the boundary effects caused by a trend and implement methods to reduce them.
Newcastle University, BSc Mathematics and Statistics
Supervisor: Sean Ryan
Detecting Changes through Transformations
Changepoint detection relates to the problem of locating abrupt changes in data when the properties of a given time series have changed. This can be extended into finding whether or not a changepoint has actually occurred and if there are multiple changepoints. This area of statistics is hugely important and has many real world applications such as medical condition monitoring and financial fluctuation detection.
The most studied method for detecting changepoints looks at changes in mean within a time series. This is a popular approach due to the fact that changes like these can be detected by transforming the data and then analysing changes in the mean of the transformed data. Other methods which may prove more accurate at detecting changepoints include looking at changes in variance.
My project aims to analyse various methods of identifying changepoints, whilst studying the advantages and limitations of each approach. This involves the construction and evaluation of numerous algorithms which are used to detect changepoints.
Lancaster University, MSci Natural Sciences
Supervisor: Georgia Souli
Optimisation Problems with Fixed Charges Associated with Subsets
Optimisation problems appear in a wide range of applications from investment banking to manufacturing. They involve finding the values of a number of decision variables (for example, the amount of different products that should be manufactured) to maximise (or minimise) a particular objective function (for example, profit), subject to a number of constraints. In many situations, the value of one or more of the decision variables must be an integer to give a feasible solution. These are called Mixed Integer Programs (MIPs).
The particular focus of my project is cutting planes. These are inequalities which are satisfied by all the feasible solutions to the MIP but not by all of the solutions that would be feasible if we ignored the integer constraints. The aim is to investigate different cutting planes in problems where we have fixed charges associated with subsets. In these problems, we have a set of continuous variables whose sum is bounded. We also have subsets of variables defined such that, if any variable in that subset takes a positive value, then a fixed charge is incurred. For example, the variables may represent the amounts of various items to be manufactured and the fixed charges would be start-up costs associated with machines involved in the production of subsets of these items. Cutting planes can be used to remove infeasible solutions to the MIP to focus in on the feasible region and hence the optimal solution to the problem.
Warwick, BSc Mathematics
Supervisor: Alexander Fisch
Modelling the behaviour of Kepler light curve data with the aim of exoplanet detection
Many exoplanets are detected via the so called transit method. This involves measuring the luminosity of a certain star at regular time intervals to obtain graphs known as light curves. A regular short sharp dip in luminosity could be caused by an exoplanet passing in front of the star. This sounds simple in theory but in reality there is lots of random noise, and the signal induced by planetary transits is very weak (even a planet the size of Jupiter reduces the luminosity of the sun by only 1% during a transit).
In order to remove some noise caused by phenomena such as sun spots NASA preprocesses their data to produce a so called whitened light curve. However their current method introduces complications and affects the signature of the transits, which makes the detection of the planets from the whitened data much harder.
My project will be focussed on modelling the data in such a way as to not distort the transit signals. So far I have been using R to remove dominant sine waves from the data and will go on to investigate periodicity and autocorrelation within the data.
Lancaster University, MSci Mathematics and Statistics
Supervisor: Stephen Ford
Allocation of limited number of assets
Having just completed my third year at Lancaster University and consider doing a PhD, the STOR-i internship was a great way for me to gain an insight into PhD life. The project I have been assigned is to do with assigning limited assets to a dynamical system. The problem that arises is if we choose to deploy an asset in the present it can’t be used later but it may be more rewarding to use it in the future. We wish to deploy them so that the reward gained is optimal. To do this, we use dynamic programming which is starting from the end and working backwards to the start, optimizing in stages, this doesn’t always yield an optimal solution but assuming certain properties of the system it will. The task at hand is finding the optimal policy, where a policy is a mathematical way to decide what decision should be made in the present given the current state of the system.
University of Bath, Mmath Mathematics with Industrial Placement
Supervisor: Edwin Reynolds
Heuristics for Real-time Railway Rescheduling
In railways networks, a single delayed train can delay other trains by getting in their way. This is called reactionary delay and is responsible for over half of all railway delays in the UK. Railway controllers therefore have to make decisions in real-time that minimise the amount of reactionary delay. Such decisions include ‘should I cancel a train, and if so which one?’ and ‘which train should leave the station first if they can only go one at a time?’ There currently exists algorithms that can find the optimal solution to these problems. However the amount computational time required to run the algorithm, especially on a large network, makes solving these problems in real-time infeasible. An alternative approach is use a heuristic, which solves the problem with a lower degree of accuracy but produces an answer in much less time. My project involves developing multiple heuristics, comparing their advantages and limitations and deciding on a final idea.
The University of Manchester, BSc Mathematics
Supervisor: Henry Moss
Preventing overfitting in Natural Language Processing.
Natural Language Processing (NLP) allows computers to understand human speech and writing. The standard approach in NLP is to fit the model in a way that avoids relying on features over-represented in the sample (known as overfitting). There are two methods: regularization and term-frequency weighting. There is no clear consensus on which method is best. Project’s aim is to investigate the relationship between these two approaches, alongside tests across a range of NLP tasks.