13-1-2016

Welcome to my blog! Feel free to contact me for any comments and/or corrections. These are greatly appreciated.

Last Thursday — Friday, 7th and 8th of January, I attended the STOR-i annual conference. This was at the LICA building, at the University of Lancaster. I had been in the building before for registration in October, but this time I had a better look. The entrance is to a big hall with wooden padded walls and a ceiling about 5m tall with an opaque window at the top. Here we got to mingle with the speakers and eat croissants; just outside the Imagination Laboratory.

The talks were given in one third of the main action room. They were also divided into manageable time frames, one hour and a half with two speakers; and were gratefully followed by food and refreshments.

**Day 1**

Our first speaker was Pitu Mirchandani* who gave a really interesting talk on the use of OR to model routing for electric vehicles. Electric vehicles require this (at the moment) as they don't have as many recharging stations as their rivals. This means that the electric car drivers may have to detour from the shortest route to their destination to recharge. Therefore a model is needed to suggest the electric car user the "best" route.

A specific type of recharging station was preferred, the battery replacement stations; these require very little stopping time as the current battery can be swapped for a full one rather quickly. As opposed to the other option that perhaps not considered in its full potential are the "fast" recharging stations which typically take thirty minutes and (partially) recharge the battery for free. Which could be useful if the drivers are planning a long trip and wish to stop for a coffee break.

For the individual user, a model/system can be implemented in a GPS, for example, to plan a route that will take into account the mileage that the current battery can make and will reserve a battery on the recharging stations necessary. It will also take into account reducing detouring and minimising anxiety levels; these were defined to increase as the battery power is used up. The system uses a modified Dijstra's algorithm to find the "best" path.

There exists however, a separate problem for the battery replacement stations. The number of batteries in stock and charged at any given time. They need to take into account that is if a customer takes a full battery, the one that has been dropped off requires some time to charge. By having an estimate of the number of customers and their arrival times, this can be approximated using dynamic programming which simulates the drivers and future delays. Actual testing in Arizona showed that following these predictions gives considerably fewer delays.

For the infrastructure investors, where is it best to build a recharging station? And, of what type?

The first question was partially answered by Dr. Mirchandani, I say partially since he is currently making further progress on it. A very basic outline would be to consider if we have one electric car user, knowing the origin (O), destination (D) and (current) maximum vehicle battery life, we can calculate the optimal location for our recharging station. This can be found to be in some interval around the center of the journey, which will also minimise the maximum anxiety level. As the number of drivers increase, there are effectively more OD trips to take into account which makes the problem much more difficult.

The type of station could perhaps depend on the distance between densely populated areas. More abundance of battery replacement stations in the desnsely populated areas, recharging stations in less populated areas and with a wider spread. If say our origin was a big city (where we have conveniently changed our battery before departure) and destination is a town 230 miles away. With such a long trip it is advised to have at least one break and also, the battery probably won't last the whole journey. Therefore if we place the "fast" recharging stations say on most main roads in a 140 mile radius outside the city with no reasonably close towns (essentially "in the middle of nowhere"). It is going to be quite successful as many drivers will prefer to have their battery charged for free while they can relax, to paying for a replacement battery.

*"They do pretty much anything chemical, it's what I found" — Lisa Turner (about Lubrizol).*

*School of Computing, Informatics and Decision system Engineering, ASU.

10-2-2016

It's been three weeks since our Masterclass week with Prof. Barry L. Nelson and I find myself doing some research on Simulation Optimisation. Particularly on improving its performance using parallel computing.

Some of the problems Prof. Nelson described can be solved in Simio (Simulation Software), which conveniently has an option to run processes in parallel. In the simulation case this means that multiple model simulations are able to run at the same time for different servers/workers/customers. Before I try to explain how parallel computing works, I will clarify what these simulation models involve. First, there is some source of randomness $U$, which is taken from some iid distribution, usually $U(0,1)$. We can then generate some input for our simulation using a specific algorithm. Let $X$ be the input random variables with known distributions. After processing these, they become $Y$ output random variables whose properties simulations want to estimate.

$$U \longrightarrow X \longrightarrow Y$$ Why would we want to use parallel computing apart from the obvious reduction of computation time (which is already a good enough reason to use it!)?

**1.** Geographical distribution. It allows people from different parts of the world to create a virtual world they can all explore.

**2.** Integrating simulators. It also enables different simulators to be linked together to create a new virtual environment.

**3.** Fault tolerance. If more than one processor is running a particular simulation, and one of the processors fails, another one can be allocated to that same process in order to keep the simulation online.

How does it actually work?

The hardware platforms which can sustain parallel computing are: distributed computers (normal ones) with networked workstations and interestingly enough, parallel computers. The processors for the latter are usually physically close to each other, usually in a cabinet in a big room. So they can be accessed remotely.

The way the virtual environment allocates the processors is usually in a first come, first served basis, where if there is no memory available, a task must wait in a queue.

22-2-2016

Last Friday we tried and successfully presented a solution for a vehicle transportation problem presented to us by three members of the West Yorkshire Police Transport Division. The problem was the allocation of different types of vehicles to police stations throughout the five regions of West Yorkshire.

Here follows an outline of the problem.

Types of vehicles:

** 1.** Marked cars,

** 2.** Plain cars,

** 3.** Vans,

** 4** Motorcycles,

** 5.** Other (LGV, 4x4, MPV,...).

Some of these vehicles are not allowed to service certain types of demand, for example, if an priority call is received, an officer is required to attend as quickly as possible. This may require breaking traffic laws and the only vehicle allowed to do so is a marked car. Hence $\geq 1$ of such vehicles must be available at that time. In reality, a plain car could attend if they are considerably closer to the location as their aim is to protect and avoid potentially lethal situations.

Regions:

**BD -** Bradford,

**CD -** Calderdale,

**KD -** Kirklees,

**LD -** Leeds,

**WD -** Wakefield.

Regions have different requirements due to population, the concentration of urban or rural areas, surface area, etc; lead to a varying crime rate and service demand. Police stations around the five districts have a different number of vehicles, which at the moment does not seem optimal as officers do not always have the required vehicle. As it was pointed out, this is not always due to the lack of vehicles but because the officers taking preference over certain vehicles and/or misuse.

Due to the time restrictions we decided to focus on marked cars as these are the most constraining and predict the number of these required to serve all priority/emergency calls. To achieve this we developed a model around this constraint: $$P\left(\text{ E or P } \cap \text{ Marked Car available}\right) \leq \varepsilon.$$

This means that the probability of receiving an Emergency or Priority call and having a marked car available to service it, is smaller than a fixed small $\varepsilon>0$. This ensures that with very high probability, there will be a marked car available.

With this in mind, we thought about different ways of modelling. We came up (mainly Lucy's contribution) with a neat idea of formulating the problem as a $M/G/\infty$ queue. Assuming a fixed arrival rate $\lambda$, which I shall discuss shortly, of Emergency/Priority calls we could use marked cars as the equivalent of servers without worrying about their location. A marked car (server) becomes free (idle) as soon as they have dealt with the call, whether this means going back to the police station to fill in some paper work or carrying on patrol after securing the perimeter.

In a pessimistic manner, we determined the arrival rate $\lambda$ using a cut-off point of $10\%$ of the maximum number of these type of calls in a three month period. This may lead to an overestimation in the required number of cars, but since we could not move vehicles around regions or police stations within the same region. It seemed like a reasonable assumption as peoples lives are potentially at stake. More sophisticated methods of Extreme Value Theory could have been applied here.

One more parameter remains unknown, the service time $\mu$. It resulted to be fairly similar for these types of calls, and since we had accurate data of when the time a patrol arrived and left the scene, we could easily make our predictions.

The required number is the maximum number of marked cars $S$ such that an $M/G/S$ model behaves like a $M/G/\infty$ queue. For a given $\varepsilon$ say $0.01$ to satisfy $99\%$ of the demand we could easily determine the required number of marked cars.

29-2-2016

In this blog post, I introduce a topic on which we had a talk by Dr Marc Goerigk. The topic as the title suggests is interesting in both, the mathematics and the applications, aspects. The technical aspect involves Mixed Integer Linear Programming dynamically changing across time for whose formulation we use concepts from Graph Theory. There are methods for continuous time but we will stick to discrete time. The application is to evacuate people from a certain location; this may be a school, a skyscraper, a stadium or even a whole city.

There are two main types of modelling in emergency evacuation: micro and macroscopic modelling. Microscopic models take every person individually and attempt to simulate irrational and collective behaviour. Macroscopic modelling focuses more on the tactical decisions and treat the crowds as a homogeneous group not considering individual decisions. For the purpose of this blog post, we shall focus on the latter. The main advantage of the methods to be discussed is that they are widely applicable to multiple scenarios. This is due to their generic formulation in form of a graph. For a given building or, road map, we can very easily obtain a graph with the vertices being the intersection points (between corridors, roads, doors) or sources of evacuees, we denote these by $i$. The edges $(i,j)$ being the corridors, roads or possible routes towards connecting vertices to the nearest evacuation point or sink $t$.

The notation in Fig. 1 is $\lambda_{ij}$ time taken for one person to traverse edge $(i,j)$ and $c_{ij}$ is the maximum number of people this particular route section can take or capacity of the edge. We could easily apply the a minimum cost flow algorithm to find the quickest path where the cost in this case is the total evacuation time, however, it doesn't account for time so we create a dynamic graph. A dynamic graph is a static graph which is repeated several times, one for each time frame until we reach the maximum time $T$. This requires relabelling of vertices for each time frame.

Let $G=(V,E)$ be a digraph with set of vertices $V$ and set of edges $E$. Each edge $(i,j)\in E$ has travel time $\lambda_{ij}$. The time expansion of $G$ over a time horizon $T$ defines the dynamic graph $G_T=(V_T,E_T)$ where $$V_T = \{i(t)~|~i\in V;~ t=0,1,\dots,T\}.$$ With $E_T = E_M \cup E_H$ where $$E_M = \{(i(t),j(t'))~|~(i,j)\in E;~t+\lambda_{ij}\leq T;~t=0,1,\dots,T \}$$ and $$E_H = \{(i(t),i(t+1))~|~i\in V;~ t=0,1,\dots,T-1 \}.$$ The set components of $E_T$ can be seen as movement edges $E_M$, which connect vertices among the different time frames using an appropriate time step ($\lambda_{ij}$). And $E_H$ which are the holdover edges, these connect the equal vertices over different time frames. Further on they will represent people who prefer, or have no other option, to stay where they are for the next time step $(t+1)$.

To consolidate the ideas and introduce the example we shall see next time, here is how it looks.

This shows three different paths that come from applying the Maximum Dynamic Flow Algorithm (MDF). Note that unnecessary edges have been removed.

Hamacher, H. W. and Tjandra, S. A., [2001]. *Mathematical Modelling of Evacuation Problems: A State of the Art*. Berichte des Fraunhofer— ITWM Report (24).

6-3-2016

Following from the previous blog post, we will see how we can formulate and then solve a simple evacuation example using the MDF algorithm.

Let $S$ and $D$ denote the sets of source and sink nodes of the initial graph $G$ and $f_{ij}$ be the number of evacuees leaving vertex $i$ at time $t$ and arriving at node $j$ at time $t+\lambda_{ij}$ i.e. the flow. Also let $q_i$ denote initial number of people at node $i$. To include the people staying at the same node as discussed earlier, we introduce a new time-dependent variable $y_{ij}(t)$ which is defined simply as $$y_{ij}(t+1) = f_{i(t),i(t+1)}.$$ We can now formulate the problem as follows

\begin{alignat}{3} \text{min}&~~~~~~~ \sum_{t=0}^T\sum_{i\in D} tf_{id}(t)~~~~~~~~~&\ \\ \text{s.t.}&~~~~~~~ \sum_{t=0}^T\sum_{i\in D} f_{id}(t) = \sum_{j\in S}q_j&\\ &~~~~~~~ y_i(t+1) - y_i(t)~~= &\sum_{k\in \text{in}(i)}f_{kj}(t - \lambda_{ki})~~-\sum_{j\in\text{out}(i)}f_{ij}(t)\\ &~~~~~~~ f_{si}(0) = q_i &\forall~i\in S\\ &~~~~~~~y_i(0) = 0 &\forall~i\in V\\ &~~~~~~~y_i(t) = 0 &t=0,\dots,T;~\forall~i\in D\\ &~~~~~~~0\leq y_i(t)\leq c_{ii} & t=1,\dots,T;~\forall~i\in V\setminus D\\ &~~~~~~~0\leq f_{ij}(t) \leq c_{ij} &t=0,\dots,T-\lambda_{ij};~~\forall~(i,j)\in E\end{alignat}

where $\text{in}(i)=\{(j|~(j,i)\in E\}$ and $\text{out}(i) = \{j|~(i,j)\in E\}$.

The objective function minimises the average evacuation time for each person (excluding the constant). The sum is from the set of sinks to the super sink $d$.

Eq. (2), accounts for everyone to leave the danger-zone.

Eq. (3), serves as a sanity check for the flow, usually called dynamic conservation flow constraint. People cannot disappear and more people that have arrived at a vertex or were already there, cannot leave it later. Also, $t=0,\dots,T$ and $i\in V$.

Eq. (4), states that the initial flow from the super source $s$ to the graph sources $i\in S$ has to be equal to the initial number of people there.

Eq. (5), due to the way $y_{ij}$ are defined there cannot be a flow at negative time.

Eq. (6), people cannot stay at a sink as they have already been evacuated.

Eq. (7), no more people can stay at a certain vertex than the maximum amount of people that vertex can hold (denoted as $c_{ii}$).

Eq. (8), standard flow condition, no more flow can go through an edge than the maximum that edge can hold.

This formulation makes the assumption that the number of people of people in the building is known, which is not always the case. MDF, takes this into account and maximises the number of people brought to safety within a given time period. This simplifies the problem. Eq. (1) does not need the $t$ multiplying the flow and we change the objective to maximisation. Eq. (2) and Eq. (4) cannot be included as they involve information about the number of evacuees. So the MDF algorithm gives us an effective way of solving the remaining linear program
Before we look at the example, the concept of **chains** and **chain flows** need to be introduced. A chain is defined as a sequence of vertices $P = \{i_1,i_2,\dots,i_k\}$, $2\leq k \leq n$, such that $(i_j,i_{j+1})\in E$. With $i_j\neq i_{j'}$ for $j,~j' = 1,\dots,k-1$. A chain flow $\gamma = \langle|P|,P\rangle$ is a static flow of value $|P|$ along the chain $P$. Where
$$|P| = \min_{\forall(i,j)\in E(P)} c_{ij}$$ here $E(P)$ denotes the set of edges that the chain traverses.

** Example**

Therefore $V = \{1,2,3,4,5,6\}$, $E = \{(1,2),(1,3),(2,4),(2,6), (3,6), (4,3), (5,2)\}$, $S = \{1\}$ (source vertex) and $D=\{6\}$ (sink vertex).

**Step 1** of the algorithm is to apply the minimum cost flow algorithm to obtain an optimal solution which is a flow of value 7 shown in green on the right of Fig. 2. In this case we don't obtain a minimum cost but an evacuation time.

**Step 2** requires splitting the minimum cost flow into chain flows which, as defined above, cannot go over the same vertex twice. The decomposetion into chains flows is as follows:

$\gamma_1 = \langle|P_1|,P_1\rangle$, $P_1 = \{1,2,6\}$ with $|P_1|=1$ and $\lambda(\gamma_1) = 4$ (lower bound for the time taken to evacuate 1 person along chain $P_1$).

$\gamma_2= \langle|P_2|,P_2\rangle$, $P_2 = \{1,2,4,3,6\}$ with $|P_2|=5$ and $\lambda(\gamma_2) = 7$.

$\gamma_3= \langle|P_3|,P_3\rangle$, $P_3 = \{1,3,6\}$ with $|P_2|=1$ and $\lambda(\gamma_3) = 4$.

** Step 3** of the algorithm is to repeat every chain flow $\gamma_i$ for $t=0,\dots,T-\lambda(\gamma_i)$.
Chain flows can be seen in Fig. 2 coloured as:

$\gamma_1$ is red, repeated for $t = 0,1,2,3$.

$\gamma_2$ is yellow, repeated for $t = 0$.

$\gamma_3$ is blue, repeated for $t = 0,1,2,3$.

Note how the edges traverse across in the $t+\lambda_{ij}$ intervals. The three sinks have the sum of the flows towards them which gives a total of $13$. That means that given the current graph and time horizon, we can evacuate at most $13$ people.

13-3-2016

Before this post, the time taken to traverse an edge was constant, not taking into account the rest of evacuees also using that same route. A way of involving this into the model is by creating a time dependent $\lambda_{ij}(t)$.

For $t=1,\dots, T$ define \begin{alignat}{2} \lambda_{ij}(t) = \left\{ \begin{array}{lr} 0 & ~~~~~\text{if }c_{ij}(t-1)\neq 0\\ g_{ij}(f_{ij}(t)) & \text{otherwise} \end{array} \right. \end{alignat} with initial $\lambda_{ij}(0) = \lambda_{ij}$ as defined previously, $c_{ij}(t+1)=c_{ij}(t)-r_{ij}(t)c_{ij}(t)$ and $g_{ij}$ is a function that behaves like the one in Fig. 4 below (this corresponds to the edge (3,5) in the example on the previos post where $\lambda_{ij}(0) = 2$ and $c_{ij} = 5$). Such a function comes from experimenting with the relationship between density of people around and walking speed Zhang B. (2013). Initially, when the flow is only one person, the time taken should be the initial one. As time goes on, the flow along that edge might increase, making the time to traverse increase. If the flow is larger than the capacity of the edge, the edge cannot be used so the function $g_{ij}$ returns 0 seconds. Due to the different $\lambda_{ij}(0)$ and $c_{ij}$, we require $nm(T+1)$ function evaluations in total, which correspond to each node in each time-frame. This can be reduced considerably by applying the MDF algorithm first.

Furthermore, the recurrence relation for edge capacity uses a risk probability $r_{ij}(t)$ for a given time $t$. If the capacity of the edge has been damaged or destroyed, the time to go along the edge will drop to $0$ and will stay that way. However, it is not clear in the literature how to explicitly obtain this matrix. So we restrict the edge capacity to be either the initial value or 0 if necessary.

Even though this model is clearly more realistic than the previous models, it is more computationally expensive and perhaps even impractical. Since the $g_{ij}$ is nonlinear our dynamic conservation flow constraint is also nonlinear. Which makes the problem considerably harder to solve. There are several methods for approximating traveling times and another option could be to simply linearise $g_{ij}$, Hamacher H. W. et. al. (2001) pp. 23-26 (see reference in the first post of the series).

Zhang, B., [2013]. * Application of Mathematical Model of Evacuation for Large Stadium Building.* Research Journal of Applied Sciences, Engineering and Technology 5(4): pp. 1432-1440.

3-4-2016

The dramatic rise of population in the last 65 years has led to a natural need for more natural resources and living space. The burning of fossil fuels and release of greenhouse gases as well as the reduction of flora is causing a rise in global temperatures. Provoking the ice sheets in the poles to melt, sea levels to rise and ultimately to a disruption in weather patterns. As a result, more extreme weather events are being experienced all over the globe: floods, heat waves, typhoons, hurricanes, tsunamis, draughts, etc. These extreme weather events seem to be increasing in magnitude as well as frequency and considering the great human and economic costs of these, there is an interest in developing methods to predict them.

In this series of blog posts we discuss several methods to achieve this, including a quick tutorial on how to use them in R.

Use of scientific methods for large scale weather forecasting has existed since the beginning of the 20th century and bloomed in the mid 50's with the access to programmable computers. But it wasn't until the 60's with the launch of the first (successful) weather satellite, when meteorologists combined the data provided by these and by weather stations (local-scale), that predictions became significantly better. Nowadays, we have the same situation; data can be divided into two categories: large-scale and local-scale data. These are both collected in the form of a time series at the different locations and the two sets may contain different number of variables simply due to the apparatus used for collection.

Large-scale data, Global Climate Model (GCM) for example, are typically between 1 and 5 degrees in latitude or longitude and each grid constitutes for a single data point. The smallest GCM grids are bigger than 100 by 100 km, examples of different types of GCM grids are shown in Figure 1. Local-scale data or Regional Climate Model (RCM) are recorded by weather stations and usually subdivide the GCM's into grids no more that 5 or 10 km wide or high. Downscaling deals with the question of how we can best use GCM data (predictor) coarse data over a large area in combination with RCM data to make predictions of extreme events at the local-scale (predictand). There are several methods to achieve this, four of these will be covered over the course of the next few post.

First we wish to find significant relationships between the variables within the two data sets and then with the variables in the other data set. Multivariate Extreme Value Theory techniques are suitable to quantify this relationship and obtain a measure of covariates. They all have a common element, they attempt an estimation of the distribution at the global-scale to capture the effect of climate change and then under certain assumptions proceed with the predictions of interest.

There is no consensus on which is the "best" method. This depends greatly on the type of data, for example some methods are more appropriate when seasonality or unlabelled data is present. As for evaluating the effectiveness of these, methods such as Multiple Linear Regression (MLR) and Quantile Regression (QR) are used as a baseline comparison since these don't generally fit the tails very well (Abraham et al., 2011).

Most of us already know quite a bit about MLR but to stablish the notation lets just state the objective function with the mean square of the residuals and its corresponding solution. It solves a linear regression model of the form $y=x^T \beta + \varepsilon$. Where $x$ is a vector containing historic data from the predictor variables and $y$ is the vector for the corresponding response variable; both $x$ and $y$ are $d$-dimensional, say. The whole data set will be a collection of $(x,y)$ for $n$ distinct variables. The parameter $\varepsilon\sim\mathcal{N}(0,\sigma^2)$ is a term that accounts for the error in the estimation of $\beta$ also $d$-dimensional. MLR minimises the sum of squared residuals, this is represented by the following objective function
$$\min (y -\mathbf{X} \boldsymbol\beta)^T(y -\mathbf{X} \boldsymbol\beta)$$
where $\mathbf{X}$ is a $d\times n$ matrix of covariates, $\boldsymbol\beta$ is a matrix of parameters to be estimated.

The solution is given as
$$\hat{\boldsymbol\beta} = (\mathbf{X}^T \mathbf{X})^{-1}\mathbf{X}^T y.$$

In order to relax our assumptions and focus more on outliers as to avoid a central tendency around the mean, we introduce Quantile Linear Regression (QR). QR estimates the $\tau$-th conditional quantile for the predictor given the data by minimising an asymmetric weighting of the sum of the residuals. Ordinary regression also minimises a sum (squared) of the residuals but uses an equal weight. To obtain this unequal weighting, we use a loss function dependent on the $\tau$-th quantile, $\rho_\tau$. The $\tau$-th quantile ($\tau\in(0,1)$) for a random variable $Y$ with CDF $F$ is given by $$Q_Y(\tau) = F_Y^{-1}(\tau) = \text{inf}\{y~|~F_Y(y)\geq \tau\}.$$ The regression quantile $\hat{\beta}$ and the loss function are combined in the objective function \begin{equation}\label{eq:qr} \hat{\beta}(\tau) = \operatorname*{arg\,min}_{\beta}\sum_{i=1}^n \rho_\tau(y_i - x_i^T \beta) \end{equation} where $\{y_1,\dots,y_n\}$ is a random sample of the response variable $Y$, $x_i\in\mathbb{R}^d$ is the covariate vector corresponding to the $i$-th observation and the loss function $\rho_\tau$ is defined as $$\rho_\tau(u) = \left\{ \begin{array}{lr} \tau u & ~~~~~~~~~~\text{if }u > 0\\ (\tau - 1)u & \text{otherwise} \end{array}. \right.$$

The main algorithms to solve this linear programming problem are: Reversible Jump Markov chain Monte Carlo (RJMCMC) (Yu, 2002), simplex, interior point (Karmarkar, 1984), and finite smoothing (Chen, 2007). The last three algorithms are compared in Chen et al. (2005) with specific attention to their computational efficiency. Exploring modern computation techniques for improving the performance of the algorithms using parallelisation and sparse computing. The algorithms have their own advantages, a particularly interesting one for the finite smoothing algorithm is that due to its simple formulation and fast computational speed it can be applied easily to more complex problems. The three can also be combined to create a more stable and fast adaptive algorithm as suggested by Chen (2004).

To apply and asses this method in R, we require three packages: ismev, quantreg and splines. The first one includes extreme value distributions and very useful tools related to these, the second in specific to QR and the third is to deal with time series and splitting the data.

Once we have obtained the data and pre-processed it, we can fit a quantile regression model quite easily. In this example we use data on the temperatures from different stations in Greenland (GISS , see references below). Picking a suitable station, like Jan Mayen (70.9N, 8.7W) and after pre-processing, the data ranges from 1924 to 2015 and includes the mean temperature per year. To fit a QR model to this dataset, we can use the following piece of code:

- plot(years.jan_omit,jan.mean_omit,cex = .75,xlab = "years", ylab = "Temperature (C)",main="Mean Temperatures in Jan Mayen (Greenland)")
- X <- model.matrix(jan.mean_omit ~ bs(years.jan_omit, df=30));
- fit <- rq(jan.mean_omit ~ bs(years.jan_omit, df=30), tau=tau, method = "fn")
- jan.mean.fit <- X %*% fit$coef
- lines(years.jan_omit,jan.mean.fit,col=i)

This code plots a quantile regression line that varies for a given tau in line 3. Line 2, generates a design matrix for a regression model with the specified input parameters. The second parameter uses the function bs which splits the data into piecewise polynomials with 30 degrees of freedom in this case. Line 3 uses the rq function that performs quantile regression using a Frisch-Newton interior point method. Lines 4 and 5 combine the rest of the code to fit the regression and plot it; the inclusion of i so we can run a for-loop to plot the QR fit in different colours and quantiles. For three different values of $\tau = 0.25,~0.5,~0.75$ the plot looks as can be seen in Figure 2, and the fit is summarised by the RMSE with $0.833,~ 0.681$ and $0.921995$ respectively for each $\tau$

We can also study the distribution at the extremes, for this we need to select a suitable threshold. For this, we can use the ismev package and its functions: mrl.plot and gpd.fitrange. The first function produces a mean residual life plot which we can inspect to see which threshold from the range is most appropriate. This is usually the threshold that makes the mean residual life linear, that is where the line appears to be most straight. We can then having a range, use the second function to see which threshold in that range will give the best GPD fit.

Finally, we can repeat a similar procedure as the outlined in the piece of code but using a histogram plot and only focusing on the extreme events which will be those greater than the threshold previously picked. This will look something like in Figure 3.

Even though QR performs better than MLR for extremes, it still tends to contain bias particularly for the frequency and magnitude of extreme events; this can be seen in Figure 3. With an RMSE of $1.328$ there is still plenty of room for improvement. One way to do this is by including a smoothing term to the current objective function to reduce this bias.

Abraham, Z., Xin, F., and Tan, P. T. (2011).* Smoothed quantile regression for statistical downscaling of extreme events in climate modelling.* Proceedings of the October 2011 CIDU, Mountain View, California, USA.

Chen, C. (2004). *An adaptive algorithm for quantile regression*. Chapter in: Theory and Applications of Recent Robust Methods, Springer, pp. 39-48.

Chen, C. and Wei, Y. (2005). *Computational issues for quantile regression*. The Indian Journal of Statistics, Volume 67, Part 2, pp 399-417.

Chen, C. (2007). *A finite smoothing algorithm for quantile regression*. Journal of Computational and Graphical Statistics, Vol. 16, pp. 136-164.

NASA. GISS Surface Temperature Analysis. Available at: Station List Search.

14-4-2016

A more sophisticated approach to QR is introduced in Abraham et al. (2011) where they introduced a refined loss function or in this case a smoothness function $w_{ij}$ that acts as a measure of similarity between two predictions. Smoothed Quantile Regression (LSQR) as its name suggests is a refined version of QR to achieve a better fit. It is obtained by taking the optimisation problem defined by for the QR case and simply including an extra smoothing term which depends on the smoothing function $w_{ij}$.

This smoothness can be expressed as the constraint $$\sum_i^n\sum_{j}^n w_{ij}(f_i - f_j)^2 < c$$ where $f$ is the predicted value of the response variable and $c$ is a constant that determines the restraint of the smoothness we wish to obtain. This constraint clusters data points into classes of certain similarity as opposed to solely splitting the data into extreme and non-extreme.

The function $w_{ij}$ quantifies this measure of similarity and does so by the use of the radial basis function which is defined as $$w_{ij} = \exp \left(-\frac{||x_i - x_j||^2}{\sigma}\right)$$ for $i,j = 1,2,\dots, n$, $x_i$ corresponds to the predictor variable $i$ and $x_j$ to data point $j$. The scale parameter $\sigma$ represents the maximum distance between data points to be considered dependent.

Using the predictor variables and the radial basis function, we can easily generate a $n\times n$ symmetric matrix $W$ which is called the similarity matrix. The amount of comparisons to be made can be reduced considerably by lowering $\sigma$, this can be done as a preprocessing routine in order to avoid unnecessary calculations for $W$. By letting $D$ be a $n\times n$ matrix with $d_{ii} = \sum_j w_{ij}$ we can formulate the optimisation problem as follows \begin{equation} \hat{\beta}(\tau) = \operatorname*{arg\,min}_{\beta}\sum_{i=1}^n \rho_\tau(y_i - x_i^T \beta) + \lambda\beta^T{\boldsymbol\Sigma}\beta \end{equation} where ${\boldsymbol\Sigma} = \mathbf{X}^T(D-W)\mathbf{X}$ and the smoothing term is defined above. The parameter $\lambda$ which is user defined, is a trade of between the two terms in the equation. As $\lambda$ increases, the second term penalises the difference between predicted values very harshly. In fact as $\lambda\to\infty$, $\beta\to0$ for all covariates except for the intercept. The intercept converges quite nicely to the $\tau$-th percentile of the response variable; Theorem 1 Abraham et al. (2011).

The second term comes from simplifying the following expression \begin{equation} \sum_i^n\sum_{j}^n w_{ij}(x_i^T\beta - x_j^T\beta)^2 = \beta^T\mathbf{X}^T(D-W)\mathbf{X}\beta. \end{equation}

Due to the independence of response variables, this method can be easily extended to unlabelled data. This further improvement proves to be really useful, especially in the extreme value setting as the amount of observations can quite limited. In a machine learning setting using this type of data is known as semi-supervised learning. For us, unlabelled data can be taken as predictions from a simulation model or as more recent historic data, the idea is to have some data which we can use to test and evaluate the model built on "training" data. Unlabelled data can be included in the optimisation problem by simply including a variation to the smoothing term in the main objective function. That is the same as taking both of the sums in go up to $n+m$ for an unlabelled data set size $m$. This method was termed Linear Semi-Supervised Quantile Regression (LSSQR) by Abraham et al. (2011).

The Broyden Fletcher Goldfarb Shanno (BFGS) algorithm is one of the most popular quasi-Newton algorithms due to its speed and it can be used to solve both the LSQR and LSSQR optimization problems.

The same paper also performed a comparison of the predictions of these two methods, LSQR and LSSQR. The data used was from the NCEP for global-scale and from the Canadian Climate Change Scenarios Network for the local-scale. Each data set from 37 different weather stations consisted of 26 variables and the variable of interest was the temperature. The performance of temperature predictions was summarised by three metrics by which distinct evaluation criteria is accounted for: RMSE (magnitude), F-measure (the timing between extreme events) and frequency. Comparing the relative performance for both methods against baseline MLR and QR models revealed that LSSQR outperformed all methods including LSQR in all performance measures. Moreover, it was better at predicting the magnitude of the extreme events than the timing of these. The paper argues that the reason for this accuracy is that the method produces consistently good predictions and not few extremely precise ones.

In future blog posts we shall discuss a method called Quantile Matching (QM) which uses Extreme Value distributions and Multivariate Extremes which will also need inroduction.

Abraham, Z., Xin, F., and Tan, P. T. (2011).* Smoothed quantile regression for statistical downscaling of extreme events in climate modelling.* Proceedings of the October 2011 CIDU, Mountain View, California, USA.

15-4-2016

In previous blog posts in the series we have been looking at different Statistical Downscaling methods and not focusing much on producing an accurate distribution for extreme events. In the next posts, we will cover this matter closely. To do so we require Extreme Value Theory (EVT). Here we present a brief introduction to EVT and Multivariate Extreme Value Theory which will become the theoretical spine of Quantile Matching, the last statistical downscaling method we will see.

There are several distributions that are common and accurate depict the behaviour of the tails in the distribution (where extreme events occur) of different events. These arrive from narrowing down the possibilities of distributions satisfying certain properties. One of these properties is stability. Loosely, this means that if we present a model for a certain time period or threshold, we want the model to be consistent to variation of these. This leads naturally to some asymptotic results.

Suppose $X_1,\dots,X_n$ are independent identically distributed random variables with common cumulative distribution function $F$. As we are considering extremes, it is natural to look at the maximum values of the random variables. Let $M_n = \max\{X_1,\dots,X_n\}$ (If we were interested in the minimum values, we could simply set $M_n = -max\{-X_1,\dots,-X_n\}$). If \begin{equation} \lim_{n\to\infty}\mathbb{P}\left(\frac{M_n-a_n}{b_n}\leq x\right)= H(x) \end{equation} and $H$ is a non-degenerate distribution, then $H$ is said to follow a Generalised Extreme Value distribution (GEV). The result in the equation above is widely known as the Unified Extremal Types Theorem. The GEV distribution has the following form \begin{equation} H(x) = \exp\left\{-\left[1+\xi\left(\frac{x-\mu}{\sigma}\right)\right]^{-1/\xi}_+\right\}. \end{equation} Where $x_+ = max\{0,x\}$ and $1+\xi\left(\frac{x-\mu}{\sigma}\right) > 0$. The parameters $\mu$ and $\sigma$ correspond to the location a scale parameters respectively. The parameter $\xi$ is called the shape parameter and it controls the distribution and thus the behaviour of the tails. There are three possibilities: $(1)$ light-tailed Gumbel ($\xi = 0$), $(2)$ a heavy-tailed Fréchet ($\xi > 0$) or $(3)$ a short-tailed negative Weibull ($\xi < 0$) (Coles, S. , 2001). Common methods for estimation for the parameters $(\mu,\xi,\sigma)$ include: maximum likelihood estimation and MCMC.

There are similar asymptotic results for the multivariate case. To begin with, let us consider the bivariate case which then extends quite nicely; let $(X_i,Y_i)$ be an independent sequence of a bivariate random variables for $i=1,\dots,n$. Now, consider the vector of componentwise maxima $M_n=(\text{max}\{X_1,\dots,X_n\},\text{max}\{Y_1,\dots,Y_n\})$. As before, we are interested in the behaviour of $M_n$ as $n\to\infty$. Before we move on to the asymptotic result for the distribution of $M_n$, a marginal standardisation to a unit Fréchet distribution is really convenient. This transformation is easily obtained by dividing each component of $M_n$ by $n$, $$M_n=\left(\frac{\text{max}\{X_1,\dots,X_n\}}{n},\frac{\text{max}\{Y_1,\dots,Y_n\}}{n}\right) = (Z_1, Z_2).$$ As for the univariate case, if the limiting joint distribution exists and is nondegenerate, we can find a distribution. More formally \begin{equation} \lim_{n\to\infty} \mathbb{P}\left(Z_1\leq z_1, Z_2\leq z_2\right)= H(z_1,z_2) = \exp(-V(z_1,z_2)) \end{equation} where $V$ is called the exponent measure and takes the form $$V(x,y) = 2\int_0^1 \max\left(\frac{w}{x},\frac{1-w}{y}\right)dM(w).$$ The function $M$ is distribution function on $[0,1]$ which captures the dependence between the two random variables. It satisfies the mean constraint, Coles, S. (2001), $$\int_0^1wdM(w)=\frac12.$$ Therefore $V$ has the following properties, $V(z_1,\infty)=\frac{1}{z_1}$, $V(\infty,z_2)=\frac{1}{z_2}$ (marginal distributions are unit Fréchet) and it is also homogeneous of order $-1$, $V(tz_1,tz_2) = \frac{1}{t}V(z_1,z_2)$. This last property extends the max-stability property to the bivariate case (Davison et al., 2012).

If for example we pick $M$ such that $M(0)=M(1)=\frac{1}{2}$. Then $X$ and $Y$ will be completely independent and $H(z_1,z_2) = \exp\left[ -(\frac{1}{z_1} + \frac{1}{z_2}) \right]$. For the case of complete dependence, $M(\frac{1}{2})=1$, we have $H(z_1,z_2) = \exp\left[ -\max\{\frac{1}{z_1},\frac{1}{z_2}\}\right]$.

The multivariate case has a similar definition. If the limit exists and $H$ is nondegenerate then $$\lim_{n\to\infty} \mathbb{P} \left(\bigcap_{i=1}^{m} Z_i \leq z_i \right) = H(z_1, \dots,z_m) = \exp(-V(z_1,\dots,z_m))$$ where the exponent measure can be generalised as $$V(z_1,\dots,z_m) = \int_{\mathcal{R}_1}\dots\int_{\mathcal{R}_m}\max\left\{\frac{w_1}{z_1},\dots,\frac{w_m}{z_m}\right\}dM(w_1,\dots,w_m)$$ where $\mathcal{R}_i$ are the regions where each of the $z_i$ are defined and with $$\int w_i dM(w_1,\dots,w_m) = 1~~~~\text{for } i= 1,\dots,m.$$ These also satisfy the properties afore mentioned and complete independence and dependence take the same form as for the bivariate case.

The standard dependence measure for extremes is given by \begin{equation} \lim_{z\to\infty}\mathbb{P}\left(Z_2 > z | Z_1 > z\right) = \chi. \end{equation} If $\chi = 0$ they are asymptotically independent, if $\chi>0$ then they are asymptotically dependent. This generates one main limitation. Apart from the degenerate case of complete independence, variables are usually asymptotically dependent. This provides a poor approximation to dependence at the finite level which leads to inconsistencies with the magnitude or rareness of an extreme event. A fitted model at any specific threshold will overestimate dependence on extrapolation. Recent work has looked at inference for asymptotic independent models: Heffernan and Tawn (2004), Heffernan and Resnick (2007), Keet et al. (2013) or more recently Papastathopoulos and Tawn (2014). A more detailed discussion on a refined dependence measure will be covered in the next blog post.

Also under the assumption of stability, we can find the distribution for a variable of interest given that it is greater than a certain threshold. More formally, this is summarised in the generalized Pareto distribution (GPD) as follows \begin{equation} \lim_{n\to\infty}\mathbb{P} \left(X\leq x|X>u\right) = \left\{ \begin{array}{lr} 1- \left(1+\frac{\xi (x-u)}{\sigma}\right)^{-\frac{1}{\xi}}_+ & ~~~~~~~~~~\text{if }\xi \neq 0\\ 1-\exp\left(\frac{x-u}{\sigma}\right) & \text{if }\xi=0 \end{array}. \right. \end{equation} The parameters $\xi$ and $\sigma$ are the shape and scale parameters.

Coles, S. (2001). *An introduction to statistical modelling of extreme values*. London, United Kingdom: Springer-Verlag.

Davison, A. C., Padoan S. A., and Ribatet, M. (2012). *Statistical modelling of spatial extremes*. Statistical Science, Vol. 27, No. 2, pp. 161-186.

Heffernan, J. E. and Resnick, S. I. (2007). *Limit laws for random vectors with an extreme component*. Ann. Appl. Probab. 17, no. 2, pp. 537-571.

Heffernan, J. E. and Tawn, J. A. (2004). *A conditional approach for multivariate extreme values (with discussion)*. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66, pp. 497-546.

Keef, C., Papastathopoulos, I. and Tawn, J.A. (2013). *Estimation of the conditional distribution of a vector variable given that one of its components is large: additional constraints for the Heffernan and Tawn model*. Journal of Multivariate Analysis, vol 115, pp. 396-404.

Papastathopoulos, I. and Tawn, J. A. (2014). *New representations for the conditional approach to multivariate extremes*. Source arXiv.

17-4-2016

Here we discuss how to apply the EVT techniques to a Statistical Downscaling method named Quantile Matching which makes use of a specific type of CDF to model the behaviour of climate change at a large-scale level and then produce an appropriate CDF for the local-scale that captures this behaviour. This CDF is defined using the generalised Pareto Distribution (GPD) and relies on the underlying assumption that local-scale variables will have a similar nature/behaviour as large-scale ones and therefore will undergo similar changes at similar times. This seems to be true in practice, Michelangeli et al. (2009).

The first step is to split the data into a validation or model fitting (future/predicted, $F$) and calibration (historic and present data, $P$) sets. The calibration set can be obtained either by using data from simulations or by choosing more recent data or an appropriate cut based on some assumptions. Let $X$ and $Y$ denote variables of interest in the large-scale and local-scale data respectively. Further, the need to distinguish between present and future data introduces $X_{P},~X_{F}$ and $Y_P,~Y_F$ which represent historic and future data for both large-scale and local-scale variables. We have observations for all three data sets and would like to predict $Y_F$.

Keeping in mind we want to model extreme events, or in this case observations over a certain threshold (temperature, precipitation, sea levels, etc). We can make use of the GPD. Even though the main aim is to model extreme events, we still wish to perform a good estimation for the values below the threshold. For this we can apply a standard non-parametric model like the Gaussian kernel smoother for example ($\tilde{H}$); this is done in Towe et al. (2011).

All these notions are summarised in the XCDF transform method introduced by Kalleche et al. (2011) and further explored in Towe et al., (2011). It is defined as follows \begin{equation} H_{Y_F}(y) = H_{Y_P}\left(H^{-1}_{X_P}\left(H_{X_F}\left(y\right)\right)\right) \end{equation} where \begin{equation}\label{eq:q match} H_Y(y) = \left\{ \begin{array}{lr} \tilde{H}_Y(y) & ~~~~~\text{for }y\leq u\\ 1-\phi_Y\left(1+\frac{\xi_Y(y-u)}{\sigma_Y}\right)^{-1/\xi_Y} & \text{for }y > u \end{array}. \right. \end{equation} Here $\phi_Y = 1 - \tilde{H}_Y(y)$ is a normalising constant. In the assumption discussed is taken into account. The change in distribution from $Y_P$ to $Y_F$ corresponds to a change in the distribution at a quantile scale of $X_P$ to $X_F$. The change in the distributions at the global scale may be attributed to climate change.

Let us denote the distribution at the tails as GPD($u,\phi_Y,\sigma_Y,\xi_Y$) requires the introduction of four parameter for each of the four variables, that is $12$ parameters for the expression $H_{Y_F}$. However, making the assumption that the thresholds are equal reduces these to $8$, an inflation factor can be introduced such that an exceedance applies for each of the data sets, Michelangeli et al. (2009). Further checking the assumption that $\xi_{X_P} = \xi_{X_F}$ using likelihood ratio may reduce the number of parameters to $6$.

Using the definitions above, we can obtain an expression for the tail distribution of $Y_F$ \begin{equation} H_{Y_F}(y) = 1 - \phi \left(1+\frac{\xi_{Y_P}}{\sigma_{Y_P}}\left\{u_{X_P}-u_{Y_P}+\frac{\sigma_{X_P}}{\xi_{X_P}}\left [ 1+\xi_{X_F}\left(\frac{y-u_{X_F}}{\sigma_{X_F}}\right)\right ]^{\frac{\xi_{X_P}}{\xi_{X_F}}}_+-1 \right\}\right)^{-\frac{1}{\xi_{Y_P}}} \end{equation} Applying the assumption that $\xi_{X_P}=\xi_{X_F}$, the expression above reduces to a $$\text{GPD}(u_{Y_F},\phi,\sigma_{Y_F},\xi_{Y_P})$$ where $u_{Y_F} = u_{X_F}+\frac{\sigma_{X_F}}{\sigma_{X_P}}(u_{Y_P}-u_{X_P})$ and $\sigma_{Y_F}=\frac{\sigma_{Y_P}\sigma_{X_F}}{\sigma_{X_P}}$. These parameters as well as the shape parameter can be estimated using some of the methods discussed in the previous post, maximum likelihood is the usual one. We could also restrict the sign of the shape parameter to be positive as this seems to be the case when studying climate modelling, however there has been some exceptions so we will not make such assumption.

To model dependence, we can introduce a refined dependence measure related to $\chi$. In Towe et al., (2011), the authors introduce instead of a Fréchet margin as discussed previously, Laplace margins. These were proposed by Keef et al. (2013 as an extension to the formulation first presented by Heffernan and Tawn (2004) and formalised by Heffernan and Resnick (2007). The exponential tails and symmetry of the distribution allow to model negative dependence.

Suppose we wish to model the dependence of two variables of interest $(Y_1,Y_2)$ with corresponding GPD distributions (as defined above) of $H_{Y_1}$ and $H_{Y_2}$ respectively. The Laplace margins transformation is as follows \begin{equation} S_i(Y_i) = \left\{ \begin{array}{lr} \log\{2H_{Y_i}(Y_i)\} & ~~~~~\text{for }Y_i < H_{Y_i}^{-1}(0.5)\\ -\log\{2(1-H_{Y_i}(Y_i))\} & \text{for }Y_i \geq H_{Y_i}^{-1}(0.5) \end{array},~~~~i = 1,2. \right. \end{equation} Then \begin{equation} \mathbb{P}(S_i < s) = \left\{ \begin{array}{lr} \frac{\exp(s)}{2} & ~~~~~~~~~~~~~~~~~~~~~~~\text{if }s < 0\\ 1- \frac{\exp(s)}{2} & \text{if }s \geq 0 \end{array},~~~~~~~~~~~~~~~~~i = 1,2. \right. \end{equation} So both upper and lower tails of $S_i$ are exponentially distributed.

With this transformation we can introduce the conditional dependence measure proposed in Keef et al. (2013) \begin{equation} \lim_{u\to\infty}\mathbb{P}\left(S_1-u>x,~\frac{S_2 -\alpha S_1}{S_1^\beta}\leq z \;\middle\vert\; S_1>u \right)=\exp(-x)G(z) \end{equation} with $-1\leq\alpha\leq 1$ and $-\infty<\beta<1$. The first parameter captures the positive or negative asymptotic independence of $S_1$ given large $S_2$. The case when $\alpha = 1$ $\beta=0$ corresponds to asymptotic dependence i.e. $\chi>0$ for $$\chi=\lim_{p\to1}\mathbb{P}\left(Y_2>H_{Y_2}^{-1}(p)|~Y_1>H_{Y_1}^{-1}(p)\right).$$ An extra condition is required for $G$ to ensure it is well defined, that is $\lim_{z\to\infty}G(z)=1$.

The only drawback with this transformation is that $G$ has no simple parametric model; therefore we proceed to estimate the limit using regression, under the assumption of a large enough threshold $u$, as suggested in Keef et al. (2013). From this assumption, it follows that $$Z=\frac{S_2 -\alpha S_1}{S_1^\beta}$$ is a random variable with distribution function $G$ and mean and variance $\mu$ and $\sigma^2$, which is independent of $S_1$ (Towe et al., 2011).

The regression model is therefore $$S_2=\alpha S_1 + S_1^\beta Z.$$

This now allows us to use moment estimation or likelihood methods to estimate the parameters $\left(\hat{\alpha},\hat{\beta},\hat{\mu},\hat{\sigma}^2\right)$ under the assumption that $Z$ is Normally distributed. The assumption that $Z$ follows a Laplace distribution was also explored but a Normal distribution was found to be more appropriate (Keef et al., 2013). With these estimates, we can proceed to perform nonparametric bootstrap or a kernel smoothed CDF to estimate $G$. This in turn enables us to use an inverse transform to the original $(Y_1,Y_2)$ variables and therefore can ultimately provide the required prections at the local-scale.

Heffernan, J. E. and Resnick, S. I. (2007). *Limit laws for random vectors with an extreme component*. Ann. Appl. Probab. 17, no. 2, pp. 537-571.

Heffernan, J. E. and Tawn, J. A. (2004). * A conditional approach for multivariate extreme values (with discussion)*. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66, pp. 497-546.

Kallache, M., Vrac, M., Naveau, P., and Michelangeli, P. A. (2011). *Nonstationary probabilistic downscaling of extreme precipitation*. J. Geophys. Res., 116, D05113.

Keef, C., Papastathopoulos, I. and Tawn, J.A. (2013). *Estimation of the conditional distribution of a vector variable given that one of its components is large: additional constraints for the Heffernan and Tawn model*. Journal of Multivariate Analysis, vol 115, pp. 396-404.

Towe, R., Eastoe, E., Tawn, J. A., and Jonathan, P. (2011). *Statistical downscaling for future extreme wave heights in the North Sea*. Preprint.