# Research Outputs

## 2023 Publications

Benton, J., Shi, Y., De Bortoli, V., Deligiannidis, G. and Doucet, A. (2023) From Denoising Diffusions to Denoising Markov Models *Royal Statistical Society, Series B, together with its discussion *A unified framework for designing generative models with general noising processes. The importance of this work is that it enables the ideas of diffusion generative models to be applied to non-real valued data. This could include discrete data, network, graphs, or e.g. data that is restricted to lie on a manifold.

Rimella, L., Alderton, S., Sammarro, M., Rowlingson, B., Cocker, D., Feasey, N., Fearnhead, P. & Jewell, C. (2023) Inference on Extended-Spectrum Beta-Lactamase Escherichia coli and Klebsiella pneumoniae data through SMC^2 *Journal of the Royal Statistical Society Series C: Applied Statistics*, qlad055 **(Short Summary Below)**

Rimella, L., Jewell, C. & Fearnhead, P. (2023) Approximating Optimal SMC Proposal Distributions in Individual-Based Epidemic Models *To appear in Statistica Sinica,doi:10.5705/ss.202022.0198. (Short Summary Below)*

Cornish, R., Taufiq Muhammad, F., Doucet, A. & Holmes, Chris. (2023) Causal Falsification of Digital Twins *arXiv e-prints, pp. arXiv:2301.07210.*

Sutton, M., Fearnhead, P., (2023) Concave-Convex PDMP-based sampling *To appear in Journal of Computational and Graphical Statistics.* **(Short Summary Below)**

## 2022 Publications

Christophe, A., Lee, A., Power, S. & Wang A.Q., (2022) Poincaré inequalities for Markov chains: a meeting with Cheeger, Lyapunov and Metropolis *arXiv preprint arXiv:2208.05239*

Chevallier A., Fearnhead P. & Sutton M. (2022) Reversible Jump PDMP Samplers for Variable Selection *To appear in Journal of the American Statistical Association, doi:10.1080/01621459.2022.2099402 ***(Short Summary Below)**

Corbella, A., Spencer, S. & Roberts, G. (2022) Automatic Zig-Zag sampling in practice *Statistics and Computing, 32, 107 doi:10.1007/s11222-022-10142-x*

Sutton, M., Salomone, R., Chevallier, A. & Fearnhead, P., (2022) Continuously-Tempered PDMP Samplers* Advances in Neural Information Processing Systems*

Andrieu, C., Lee, A., Power, S. & Wang A.Q. (2022) Explicit convergence bounds for Metropolis Markov chains: isoperimetry, spectral gaps and profiles *arXiv preprint arXiv:2211.08959*

Andrieu, C., Lee, A., Power, S. & Wang, A. (2022) Comparison of Markov chains via weak Poincaré inequalities with application to pseudo-marginal MCMC *The Annals of Statistics, 50, pp.3592-3618. (Short Summary Below)*

Fong, E., Holmes, C. & Walker, S. G. (2022) Martingale posterior distributions *To appear in Journal of the Royal Statistical Society, Series B. (*

**Short Summary Below)**

Buchholz, A., Ahfock, D. & Richardson, S. (2022) Distributed Computation for Marginal Likelihood based Model Choice *Bayesian Analysis, 18 pp. 607-638*

Jersakova, R., Lomax, J., Hetherington, J., Lehmann, B., Nicholson, G., Briers, M. & Holmes, C. (2022) Bayesian imputation of COVID-19 positive test counts for nowcasting under reporting lag *Applied Statistics 71 834-860*

Nicholson, G., Blangiardo, M., Briers, M., Diggle, P. J., Erlend Fjelde, T., Ge, H., Goudie, R.J.B., Jersakova, R., King, R.E., Lehmann, B.C.L., Mallon, A., Padellini, T., Teh, Y.W., Holmes, C. & Richardson, S. (2022) Interoperability of statistical models in pandemic preparedness: principles and reality *Statistical Science 37, pp. 183–206*

Aicher, C., Putcha, S., Nemeth, C., Fearnhead, P., & Fox, E. B. (2022) Stochastic Gradient MCMC for Nonlinear State Space Models *Society for Industrial and Applied Mathematics Vol.1, Iss.3* **(Short Summary Below)**

Vono, M., Paulin, D., & Doucet, A. (2022) Efficient MCMC Sampling with Dimension-Free Convergence Rate using ADMM-type Splitting *Journal of Machine Learning Research 23(25):1−69, 2022*

## 2021 Publications

Read, J.M., Bridgen, J.R.E, Cummings, D.A.T., Ho,A. & Jewell C.P. (2021) Novel coronavirus 2019-nCoV (COVID-19): early estimation of epidemiological parameters and epidemic size estimates *Philosophical Transactions of the Royal Society B 376 (1829), 20200265*

Birrell, P., Blake, J., Van Leeuwen, E., Gent, N. & De Angelis, D. (2021) Real-time nowcasting and forecasting of COVID-19 dynamics in England: the first wave *Philosophical Transactions of the Royal Society B 376 (1829), 20200279 *

Nicholson, G., Lehmann, B., Padellini, T., Pouwels, K.B., Jersakova, R., Lomax, J., King, R.E., Mallon, A.M., Diggle, P.J., Richardson, S., Blangiardo, M. & Holmes, C. (2021) Local prevalence of transmissible SARS-CoV-2 infection: an integrative causal model for debiasing fine-scale targeted testing data *MedRxiv*

*Nemeth, C., & Fearnhead, P. (2021) Stochastic gradient Markov chain Monte Carlo Taylor & Francis Online *

**(Short Summary Below)**

## 2020 Publications

Behr, M., Ansari, M. A., Munk, A., & Holmes, C. (2020) Testing for dependence on tree structures *Proceedings of the National Academy of Sciences of the United States of America*,* 117(18), 9787–9792 ***(Short Summary Below)**

Vats, D., Gonçalves, F., Łatuszyński, K., & Roberts, G. O. (2020) Efficient Bernoulli factory MCMC for intractable likelihoods

Touloupou, P., Finkenstädt, B., & Spencer, S. E. F. (2020) Scalable Bayesian Inference for Coupled Hidden Markov and Semi-Markov Models *Journal of Computational and Graphical Statistics, 29(2), 238–249*

Chapman, L. A. C., Spencer, S. E. F., Pollington, T. M., Jewell, C. P., Mondal, D., Alvar, J., Deirdre Hollingsworth, T., Cameron, M. M., Bern, C., & Medley, G. F. (2020) Inferring transmission trees to guide targeting of interventions against visceral leishmaniasis and post-kala-azar dermal leishmaniasis *medRxiv, 2020.02.24.20023325*

Pollock, M., Fearnhead, P., Johansen, A. M., & Roberts, G. O. (2020) Quasi-stationary Monte Carlo and the ScaLE Algorithm *Journal of the Royal Statistical Society. Series B, Statistical Methodology*

Chevallier, A., Fearnhead, P., & Sutton, M. (2020) Reversible Jump PDMP Samplers for Variable Selection *Taylor & Francis Online*

Bierkens, J., Grazzi, S., Kamatani, K., & Roberts, G. (2020) The Boomerang Sampler *Proceedings of the 37th International Conference on Machine Learning*, *PMLR 119:908-918, 2020 *

Fong, E., & Holmes, C. C. (2020) On the marginal likelihood and cross-validation *Biometrika*, *107*(2), 489–496 **(Short Summary Below)**

## 2019 Publications

Wang, A. Q., Pollock, M., Roberts, G. O., & Steinsaltz, D. (2019) Regeneration-enriched Markov processes with application to Monte Carlo

Buchholz, A., Ahfock, D., & Richardson, S. (2019) Distributed Computation for Marginal Likelihood based Model Choice

Mider, M., Jenkins, P. A., Pollock, M., Roberts, G. O., & Sørensen, M. (2019) Simulating bridges using confluent diffusions

## Short Paper Summaries

### Quasi-stationary Monte Carlo and the ScaLE Algorithm

The key idea of the paper is to try and draw samples from the posterior distribution by simulating a Markov process with killing. Such processes will eventually die out, and there long-term behaviour is often described by its quasi-stationary distribution. That is we consider the limiting distribution of the process conditional on it not dying.

This leads to new algorithms for sampling from a posterior, though this does not come without some challenges -- simulating from the quasi-stationary distribution is much harder than simulating from the stationary distribution in MCMC. The algorithm presented in the paper is based on constructing a killed Brownian motion, for which it is straightforward to derive the killing rate so that the killed Brownian motion has the posterior distribution as its quasi-stationary distribution. However simulating killed Brownian motion is non-trivial, and requires the use of ideas from exact simulation of diffusions. Simulating from the quasi-stationary distribution is then achieved by embedding the forward simulation of the killed Brownian motion within a Sequential Monte Carlo Algorithm. The paper shows that the resulting algorithm, called ScaLE, can have good properties in terms of how it scales with the number of data points -- as one can simulate the killed Brownian motion exactly whilst only using subsamples of data at each iteration.

This work was developed as part of the i-like research project, but has also been supported by both the Bayes4Health and CoSInES grants.

### Stochastic gradient Markov chain Monte Carlo

Stochastic gradient Markov chain Monte Carlo algorithms are increasingly popular approaches for approximate sampling from posterior distributions in big data. One example is stochastic gradient Langevin dynamics, which can be viewed as a computationally-efficient approximation to the MALA algorithm – but where the Metropolis-Hastings accept-reject step is removed and the gradient of the posterior is approximated using a small sub-sample of the data at each iteration. This algorithm is much faster to run than MALA, but no longer target the true posterior. However in many applications where there is limited compute resource, it can give more accurate results than standard MCMC algorithms.

This paper reviews this class of algorithms, gives an informal overview of recent theoretical results on their Monte Carlo and approximation error, compares them against standard MCMC algorithms, and points to open research challenges.

### On the marginal likelihood and cross-validation

In Bayesian statistics, the marginal likelihood, also known as the evidence, is used to evaluate model fit as it quantifies the joint probability of the data under the prior. In contrast, non-Bayesian models are typically compared using cross-validation on held-out data, either through k-fold partitioning or leave-p-out subsampling. We show that the marginal likelihood is formally equivalent to exhaustive leave-p-out cross-validation averaged over all values of p and all held-out test sets when using the log posterior predictive probability as the scoring rule.

### Testing for dependence on tree structures

Tree-like structures are abundant in the empirical sciences as they can summarize high-dimensional data and show latent structure among many samples in a single framework. Prominent examples include phylogenetic trees or hierarchical clustering derived from genetic data. Currently, users employ ad hoc methods to test for association between a given tree and a response variable, which reduces reproducibility and robustness. This paper introduces treeSeg, a simple to use and widely applicable methodology with high power for testing between all levels of hierarchy for a given tree and the response while accounting for the overall false positive rate. This method allows for precise uncertainty quantification and therefore, increases interpretability and reproducibility of such studies across many fields of science.

### Stochastic gradient MCMC for nonlinear state space models

Current stochastic gradient MCMC methods have been designed for models where the log-likelihood can be written as a sum and it is easy to calculate the gradient of each term in the sum. Unfortunately, the log-likelihood for state-space models cannot be written in this way as the gradient terms involve calculating integrals. This paper shows how one can still apply stochastic gradient MCMC to state space models by introducing a further approximation (using buffering ideas) when calculating the gradients; it presents theory on the resulting error of the method and gives comparisons that show when the method can be more efficient than standard MCMC approaches.

### Reversible Jump PDMP Samplers for Variable Selection

This paper extends recent PDMP-based sampling methods so that they can be used for variable choice problems, such as fitting GLMs jointly with deciding which covariates to include in the model. The algorithm developed is easy to implement if we have a PDMP sampler that can explore the posterior for a given model (i.e. a given set of covariates to include). The only difference is two additional moves, one which re-introduces covariates into the model with a constant rate; and the other removes covariates if the co-efficient associated with their effect is ever 0.

### Martingale posterior distributions

This paper presents a new approach to sampling from posterior distributions. It is based on the idea that parameters of interest would be known exactly if we had an infinitely large data set. Thus if we could simulate new data given the observed data, then we could simulate a large data set and hence obtain a parameter value from the posterior. This idea turns sampling from a posterior distribution into repeated sampling from a predictive distribution. This opens up a new approach to Bayesian non-parametrics, and has potential as a way to marry Bayesian approaches to uncertainty quantification with powerful machine learning algorithms for prediction.

### Concave-Convex PDMP-based sampling

The implementational bottleneck for PDMP samplers is simulating the event times at which the dynamics of the sampler change. Simulating these event times is equivalent to simulating events in a time-inhomogeneous Poisson process. We show that if the rate of this time-inhomogeneous Poisson process can be decomposed into the sum of a concave and convex function then there are efficient and automatic methods for simulating the event times that just need to evaluate the concave and convex functions. These are simple to implement and we provide code for doing so, and can be more efficient that other existing simulation methods.

### Comparison of Markov chains via weak Poincaré inequalities

When carrying out Bayesian inference in complex statistical models, it can be the case that standard computational methods cannot be applied, as certain key quantities (likelihoods, normalising constants, etc.) cannot be evaluated. Nevertheless, for some important model classes, there exist “exact-approximate” methods which are able to bypass this intractability and deliver an implementable algorithm which emulates the ‘ideal’ method which one would like to use, without incurring systematic bias.

In this paper, we study the theoretical properties of some of these “exact-approximate” MCMC algorithms, using the perspective of Markov chain comparison theory. This approach neatly mirrors the practical construction of the algorithms themselves, showing that for well-designed algorithms, one can perform “almost as well as” the ideal algorithm, in a precise quantitative sense. Moreover, relative to previous studies, the comparison techniques used herein are particularly straightforward and transparent.

With this perspective, we are able to provide new insights into the practical use of so-called “pseudo-marginal” algorithms, with applications to Approximate Bayesian Computation (ABC), particle marginal Metropolis--Hastings (PMMH), and related procedures.

### Inference on extended-spectrum beta-lactamase E- coli and Klebsiella pnumoniae

The paper introduces an innovative modelling approach to investigate the spread of antimicrobial-resistant bacteria in Malawi. The model adopts an individual-based framework considering the colonisation status of each individual in the population while encompassing various relevant factors in the colonisation rate, like individuals’ covariates, seasonal influence, and environmental effect. Inference is pursued through the algorithm SMC2, where an SMC nested in another SMC is used to perform particles approximation of both the posterior distributions over the individuals and the parameters.

### Approximating optimial SMC proposal distribution in... epidemic models

Conventional SMC implementations, such as the bootstrap particle filter and the auxiliary particle filter, prove to be inefficient when applied to high-dimensional state-space models, and, in particular, in individuals-based models for infectious disease modelling. This inefficiency arises mainly from the discrepancy between the proposal distribution on the individuals’ states and future observations. To overcome this issue, the paper proposes to build proposal distributions that take into account future observations by exploiting two crucial properties of the model: (i) the ability to analytically calculate the optimal proposal distribution for a single individual given future observations and the future infection rate of each individual, and, (ii) the independence of dynamics among individuals when conditioned on their infection rates.