# Research Outputs

## Selected Publications

Short summaries are available for some selected publications.

JM Read, JRE Bridgen, DAT Cummings, A Ho, CP Jewell (2021) Novel coronavirus 2019-nCoV (COVID-19): early estimation of epidemiological parameters and epidemic size estimates Philosophical Transactions of the Royal Society B 376 (1829), 20200265

P Birrell, J Blake, E Van Leeuwen, N Gent, D De Angelis (2021) Real-time nowcasting and forecasting of COVID-19 dynamics in England: the first wave Philosophical Transactions of the Royal Society B 376 (1829), 20200279

G Nicholson, B Lehmann, T Padellini, KB Pouwels, R Jersakova, J Lomax, RE King, A M Mallon, PJ Diggle, S Richardson, M Blangiardo, C Holmes (2021) Local prevalence of transmissible SARS-CoV-2 infection: an integrative causal model for debiasing fine-scale targeted testing data. MedRxiv https://www.medrxiv.org/content/10.1101/2021.05.17.21256818v1

Chevallier, A., Fearnhead, P., & Sutton, M. (2020). Reversible Jump PDMP Samplers for Variable Selection. In *arXiv [stat.CO]*. arXiv. https://arxiv.org/abs/2010.11771

Bierkens, J., Grazzi, S., Kamatani, K., & Roberts, G. (2020). The Boomerang Sampler. *Proceedings of the 37th International Conference on Machine Learning*, PMLR 119:908-918, 2020. http://proceedings.mlr.press/v119/bierkens20a.html

Fong, E., & Holmes, C. C. (2020). On the marginal likelihood and cross-validation. *Biometrika*, *107*(2), 489–496. https://doi.org/10.1093/biomet/asz077 **[short summary]**

Behr, M., Ansari, M. A., Munk, A., & Holmes, C. (2020). Testing for dependence on tree structures. *Proceedings of the National Academy of Sciences of the United States of America*, *117*(18), 9787–9792. https://doi.org/10.1073/pnas.1912957117 **[short summary]**

Vats, D., Gonçalves, F., Łatuszyński, K., & Roberts, G. O. (2020). Efficient Bernoulli factory MCMC for intractable likelihoods. In *arXiv [stat.CO]*. arXiv. http://arxiv.org/abs/2004.07471

Touloupou, P., Finkenstädt, B., & Spencer, S. E. F. (2020). Scalable Bayesian Inference for Coupled Hidden Markov and Semi-Markov Models. *Journal of Computational and Graphical Statistics*, *29*(2), 238–249. https://doi.org/10.1080/10618600.2019.1654880

Chapman, L. A. C., Spencer, S. E. F., Pollington, T. M., Jewell, C. P., Mondal, D., Alvar, J., Deirdre Hollingsworth, T., Cameron, M. M., Bern, C., & Medley, G. F. (2020). Inferring transmission trees to guide targeting of interventions against visceral leishmaniasis and post-kala-azar dermal leishmaniasis. medRxiv, 2020.02.24.20023325. https://doi.org/10.1101/2020.02.24.20023325

Pollock, M., Fearnhead, P., Johansen, A. M., & Roberts, G. O. (2020). Quasi-stationary Monte Carlo and the ScaLE Algorithm. *Journal of the Royal Statistical Society. Series B, Statistical Methodology*. https://arxiv.org/abs/1609.03436 **[short summary]**

Wang, A. Q., Pollock, M., Roberts, G. O., & Steinsaltz, D. (2019). Regeneration-enriched Markov processes with application to Monte Carlo. In *arXiv [math.PR]*. arXiv. http://arxiv.org/abs/1910.05037

Buchholz, A., Ahfock, D., & Richardson, S. (2019). Distributed Computation for Marginal Likelihood based Model Choice. In *arXiv [stat.CO]*. arXiv. https://arxiv.org/abs/1910.04672

Nemeth, C., & Fearnhead, P. (2019). Stochastic gradient Markov chain Monte Carlo. In *arXiv [stat.CO]*. arXiv. http://arxiv.org/abs/1907.06986 **[short summary]**

Vono, M., Paulin, D., & Doucet, A. (2019). Efficient MCMC Sampling with Dimension-Free Convergence Rate using ADMM-type Splitting. In *arXiv [stat.CO]*. arXiv. http://arxiv.org/abs/1905.11937

Mider, M., Jenkins, P. A., Pollock, M., Roberts, G. O., & Sørensen, M. (2019). Simulating bridges using confluent diffusions. In *arXiv [stat.ME]*. arXiv. http://arxiv.org/abs/1903.10184

Aicher, C., Putcha, S., Nemeth, C., Fearnhead, P., & Fox, E. B. (2019). Stochastic Gradient MCMC for Nonlinear State Space Models. In *arXiv [stat.ML]*. arXiv. http://arxiv.org/abs/1901.10568 **[short summary]**

## Short Paper Summaries

### Quasi-stationary Monte Carlo and the ScaLE Algorithm

The key idea of the paper is to try and draw samples from the posterior distribution by simulating a Markov process with killing. Such processes will eventually die out, and there long-term behaviour is often described by its quasi-stationary distribution. That is we consider the limiting distribution of the process conditional on it not dying.

This leads to new algorithms for sampling from a posterior, though this does not come without some challenges -- simulating from the quasi-stationary distribution is much harder than simulating from the stationary distribution in MCMC. The algorithm presented in the paper is based on constructing a killed Brownian motion, for which it is straightforward to derive the killing rate so that the killed Brownian motion has the posterior distribution as its quasi-stationary distribution. However simulating killed Brownian motion is non-trivial, and requires the use of ideas from exact simulation of diffusions. Simulating from the quasi-stationary distribution is then achieved by embedding the forward simulation of the killed Brownian motion within a Sequential Monte Carlo Algorithm. The paper shows that the resulting algorithm, called ScaLE, can have good properties in terms of how it scales with the number of data points -- as one can simulate the killed Brownian motion exactly whilst only using subsamples of data at each iteration.

This work was developed as part of the i-like research project, but has also been supported by both the Bayes4Health and CoSInES grants.

### Stochastic gradient Markov chain Monte Carlo

Stochastic gradient Markov chain Monte Carlo algorithms are increasingly popular approaches for approximately sampling from posterior distributions in big data. One example is stochastic gradient Langevin dynamics, which can be viewed as a computationally-efficient approximation to the MALA algorithm – but where the Metropolis-Hastings accept-reject step is removed and the gradient of the posterior is approximated using a small sub-sample of the data at each iteration. This algorithm is much faster to run than MALA, but no longer target the true posterior. However in many applications where there is limited compute resource, it can give more accurate results than standard MCMC algorithms.

This paper reviews this class of algorithms, gives an informal overview of recent theoretical results on their Monte Carlo and approximation error, compares them against standard MCMC algorithms, and points to open research challenges.

### On the marginal likelihood and cross-validation

In Bayesian statistics, the marginal likelihood, also known as the evidence, is used to evaluate model fit as it quantifies the joint probability of the data under the prior. In contrast, non-Bayesian models are typically compared using cross-validation on held-out data, either through k-fold partitioning or leave-p-out subsampling. We show that the marginal likelihood is formally equivalent to exhaustive leave-p-out cross-validation averaged over all values of p and all held-out test sets when using the log posterior predictive probability as the scoring rule.

### Testing for dependence on tree structures

Tree-like structures are abundant in the empirical sciences as they can summarize high-dimensional data and show latent structure among many samples in a single framework. Prominent examples include phylogenetic trees or hierarchical clustering derived from genetic data. Currently, users employ ad hoc methods to test for association between a given tree and a response variable, which reduces reproducibility and robustness. This paper introduces treeSeg, a simple to use and widely applicable methodology with high power for testing between all levels of hierarchy for a given tree and the response while accounting for the overall false positive rate. This method allows for precise uncertainty quantification and therefore, increases interpretability and reproducibility of such studies across many fields of science.

### Stochastic gradient MCMC for nonlinear state space models

Current stochastic gradient MCMC methods have been designed for models where the log-likelihood can be written as a sum and it is easy to calculate the gradient of each term in the sum. Unfortunately, the log-likelihood for state-space models cannot be written in this way as the gradient terms involve calculating integrals. This paper shows how one can still apply stochastic gradient MCMC to state space models by introducing a further approximation (using buffering ideas) when calculating the gradients; it presents theory on the resulting error of the method and gives comparisons that show when the method can be more efficient than standard MCMC approaches.