Motivating Challenges

Monitoring of Epidemics

Prompt public health responses to outbreaks of epidemics relies on real-time monitoring and prediction of their spread. Work by investigators has included methods and models used for past outbreaks of foot-and-mouth disease and avian-flu. These applications need Bayesian solutions, as the most important decisions about how to contain an outbreak are those made at an early stage of the outbreak when there is substantial uncertainty. Furthermore, this uncertainty impacts the likely course of the epidemic in a highly non-linear way. Correctly quantifying the risk of a sizeable epidemic is linked to estimating the tails of the posterior, so e.g. variational approximations cannot be used.

Unfortunately, approaches for real-time monitoring are at the limit of what is computationally feasible using current MCMC methods. This is at the same time that new technologies have led to the routine collection of other data, such as genetic information for samples of cases, that can help determine the historic path of the epidemic. To fully take advantage of this new information requires Bayesian methods that can fuse information from disparate data types, together with more efficient and scalable SMC and MCMC approaches.

Genomics and Phenotyping

There is currently a step change in the number of very large cohort studies that are collecting and analysing detailed genetic information as well as rich phenotypic data. For example, a recent genetic association study analysed 36 different quantitative blood traits, which have been linked to nearly 30 million imputed variants in a sample of approximately 180,000 individuals. Rich intermediate phenotypes (e.g. proteomics, lipidomics and metabolomics traits) are also collected on large subsamples of individuals.

In related studies there is interest in including data on lifestyle collected from wearable devices. The current analysis of these studies tend to rely on simple univariate association estimation and testing between each trait and each genetic feature, followed by post processing of results using meta-analysis and enrichment-like approaches.

Scientifically, it is of great interest to be able to expand the genomic analysis paradigm towards multifeature investigation of structured associations, uncovering groups of traits or intermediate phenotypes that are linked to multiple features, e.g. genetic variants. Statistically, this requires a massive scaling up of efficient Bayesian model choice and search strategies and algorithms that can cope with the joint analysis of complex correlated trait structure and modalities.

Related challenges arise when analyzing data collected by the international mouse phenotyping consortium (IMPC). The goal of the IMPC is to create a comprehensive catalogue of mammalian gene function through the analysis of high-dimensional data captured on multiple mice each carrying an induced null mutation at a gene, and to repeat this for every gene in the mouse genome.

Personalized Medicine

Being able to customize medical care to each individual patient has huge potential for improving patient care through earlier diagnoses and timely and more appropriate interventions and medication. The data science challenges underpinning personalized medicine involve the need to borrow information from other, similar, individuals when making inferences and decisions about a given patient.

Bayesian methods are ideally suited to such a task. Their potential has been shown recently on real-time patient monitoring: with Bayesian, model-based approaches able to predict a patient’s clinical deterioration 8 hours prior to intensive care admission, and this achieved with substantially greater accuracy than state-of-the-art critical care risk scoring technologies. Wide-spread use of these ideas requires the scaling-up of Bayesian computational methods, and the need to fuse information across different sources for each patient.