'Models' and 'models': integrating process information in spatial statistics


Posted on

Vector River Map © Nelson Minar - https://github.com/NelsonMinar/vector-river-map

Within environmental science, the meaning of the term ‘model’ varies widely by who you ask. To a more 'classical' environmental scientist focused on representation of physical processes, a model might typically mean a computation designed to directly reflect the interaction of a set of physical laws. A process model creates largely deterministic outputs dependent on the way those laws are represented along with a set of boundary conditions. A very simple process model of the volume of a glacier, for example, might calculate mass loss from melt, mass gain from snowfall, and ice thickness from a parameterisation of ice flow, constrained by an observation of the known surface area. To a statistician, however, a model can instead mean a framework for describing the probability distribution of sampling from a dataset in a way which is agnostic of the physical processes that generate the dataset's values. A very simple statistical model, again of glacier volume, might be a regression of a set of volumes calculated from direct measurements against glacier surface area raised to some power, which can then be used to estimate volume for any given surface area.

This distinction between uses of the term 'model' is not simply useful to keep in mind in discussions between people who may have different assumptions; it also suggests that, with two largely disparate definitions of a method used to similar ends, we potentially miss out on the strengths of one when focusing on the other. In the broadest sense, a model of an environmental property is in both cases an attempt to create good estimates of the values that property takes in cases where we do not have direct observations, so at the highest level a statistical model and a process model are working towards the same goal and we can examine the potential for each informing the other. The Integrated Statistical Modelling theme within DSNE identifies a set of challenges for statistical modelling in environmental science, and in challenge 5 – spatial stochastic process-based modelling – the aim is to make better use of process understanding in the creation and selection of spatial statistical models for environmental phenomena. This goes some way to bridging the gap between modelling paradigms, in this case starting within the statistical side and attempting to enhance this style of modelling with additional process information.

A particular class of properties for which we hope to improve spatial-stochastic models are those where the straight-line distance between two points is not a good predictor of how closely values at those points are related; more formally, where covariance functions based on Euclidean distance are not appropriate. This is common within the earth system because the properties we observe and estimate are often embedded within a more complex underlying geometry or topography that determines the way values at different points can influence each other. A clear example of this is within river networks, as addressed by O'Donnell et al. (2014). Consider two locations within the river system which are (by Euclidean distance) close together: they could be within the same tributary or in neighbouring tributaries which converge some distance downstream. This fact plays a major role in the expected level of correlation between observations made of river properties at those locations – pollutant concentration, for example. In this case, the use of Euclidean distance can easily create artefacts in statistical models: an observation of elevated concentration in one stream can result in elevated values in a neighbouring stream despite a downstream convergence where concentrations are lower, for example, in a way which runs contrary to a physical understanding of the system. Using a metric which respects the along-flowline distance (intuitively: “How far does water actually have to travel?”) between points on the river network reduces such problems, because it better represents the physical process by which points on the network influence each other.

River networks are a particularly clean example, but there are properties across environmental science for which the use of Euclidean distance metrics pose problems, and for which different representations of distance might yield benefits. For ice sheet surfaces, the along-flowline distance is also relevant, but the situation is complicated by the fact that flow is not well-represented by a network of 1-dimensional stream segments as it is in a river, instead better being described with a 2-dimensional surface of flow vectors. In air quality applications, physical constraints on air flow – from natural topography forming barriers between valleys to buildings forming barriers between roads – affect the influence of points on each other in a way which does not relate to simple Euclidean distance. On larger scales the existence of persistent flows and the non-equivalence of north-south and east-west distance skew 'best-effort' representations away from purely Euclidean ones in both air and ocean science. In properties that relate to human society, spatial differentials in population density might suggest useful adjustments to purely Euclidean metrics, or there may be cases where metrics describing distance as some combination of Euclidean distance and 'social connectivity' can improve models – the spread of infectious diseases being an example with which we are now all familiar.

The above examples are far from a comprehensive list but serve to illustrate how the nature of the space in which a property we want to estimate exists influences the way we need to define the distance between points in that space. Space in the environment, whether natural or human-influenced, is rarely simple and uniform, so the use of simple and uniform measures of distance between points in these spaces rarely represents the degree to which one location influences another. While it brings challenges by introducing new complexity and additional parameters, if we can manage to create spatial statistical models that represent the space they cover in a way which is most relevant to the behaviour of the property to be modelled, we can improve the results of these models. There's a philosophical appeal to this approach in addition to the potential practical benefits; it allows us to move models closer to operating on a space that looks like reality, rather than relying on assumptions of simplicity and linearity that try to force reality towards what is easiest for us to model.

Related Blogs


Disclaimer

The opinions expressed by our bloggers and those providing comments are personal, and may not necessarily reflect the opinions of Lancaster University. Responsibility for the accuracy of any of the information contained within blog posts belongs to the blogger.


Back to blog listing