# Term 2: Optional Modules

Term 2 allows for further specialisation and application in areas in which there is a considerable demand for data scientists. You will study one module (15 credits) strengthening the foundations for your Specialism modules (Statistical Inference or Computing) together with a number of elective modules (30 credits) which can be chosen to form designated pathways:

• Bioinformatics
• Population Health
• Environmental
• Societal
• Computing

The available Specialism and Elective modules (listed below) build upon the Core set and have a prerequisite level of skills and knowledge.

## Specialism Modules

In the second term, you will build on your specialism with one of the following modules (15 credits each).

## Accordion

• Bayesian Inference for Data Science MATH555
 Category Statistics module, requires Statistics Modules II

This module aims to introduce the Bayesian view of statistics, stressing its philosophical contrasts with classical statistics, its facility for including information other than the data into the analysis and its coherent approach towards inference and model selection. The module will also introduce students to MCMC (Markov chain Monte Carlo), a computationally intensive method for efficiently applying Bayesian methods to complex models. By the end of the course the students should be able to formulate an appropriate prior for a variety of problems, calculate, simulate from and interpret the posterior and the predictive distribution, with or without MCMC as appropriate and to carry out Bayesian model selection using the marginal likelihood. Students should be able to carry out all of this using the programming language R.

#### Topics covered will include

• Inference by updating belief
• The ingredients of Bayesian inference: the prior, the likelihood, the posterior, the predictive and the marginal distribution
• Methods for formulating the prior
• Conjugate priors for single parameter models
• Normal distribution, known and unknown variance, regression
• Sampling for the posterior and predictive distributions
• Model checking and model selection
• Gibbs sampling, data augmentation, hierarchical models
• The Metropolis-Hastings algorithm, random walk Metropolis, independence sampler

#### On Successful completion students will be able to

• Understand the Bayesian statistical framework and its philosophy
• Demonstrate knowledge of key concepts: the prior, the likelihood, the posterior, the predictive and the marginal distribution
• Calculate, simulate from and interpret the posterior and the predictive distribution
• Construct an MCMC algorithm for a variety of statistical models and implement them in R

#### Bibliography

• Hoff, P. (2008) A first course in Bayesian statistics. Springer
• Gamerman, D. and Lopez, H. (2006) MCMC statistical simulation for Bayesian inference. Chapman and Hall 2nd Edition
• Gilks, W.R., Richardson, S. and Spiegelhalter, D. (1996) Markov chain Monte Carlo in Practice. Chapman and Hall
• Statistical Learning CFAS420
 Category Statistics module, available to all

The module will provide students with the statistical tools needed to understand the analysis of large data sets, and the statistical background to such tools. It will seek to integrate the various methods used in such analysis into a common modelling framework. An important part of the course is on interpretation and on communicating the results to others.

#### Topics covered will include

• Introduction to statistical learning; problems of missing data, biased samples and recency
• Statistical significance and big data
• Sample splitting. Calibration, training and validation samples. Entropy and likelihood
• Unsupervised learning: K-means, PAM and CLARA for big data. Mixture models. Latent class analysis
• Variable reduction methods and variable selection. The Lasso
• Classification methods: logistic and multinomial logistic models. Probability cutoffs; the ROC curve; sensitivity and specificity
• Classification methods Regression trees, random forests and boosted trees
• Classification methods Neural networks as generalised linear modelling extensions
• Classification methods: other methods (PRIM)
• Smoothing models through GAMs
• Bayesian networks

#### Learning

Students will learn through the application of concepts and techniques covered in the module to real data sets. Students will be encouraged to examine issues of substantive interest in these studies.

#### On successful completion students will be able to

• Understand the need for a statistical basis for data analytics
• Appreciate that different terminologies used in different data analytic technologies can be integrated through statistical modelling concepts and the idea of likelihood
• Understand the tradeoff between interpretability and predictive performance
• Have gained skills about the appropriate choice of statistical learning methods for various forms of real-life problem
• Build predictions for logistic and multinomial logistic models
• Choose an appropriate clustering method which has a statistical basis
• Split big datasets appropriately, and understand the predictive performance should be based on the validation sample
• Carry out a regression tree analysis including pruning, assessing its performance
• Carry out more complex forms of regression tree ensemble techniques, including random forests
• Carry out simple neural network analyses, while understanding the need for construction of a start value strategy

#### Assessment

Two assignments (100%) to be submitted in the form of reports covering all aspects of the module material. The projects involve the analysis of datasets that require the student to investigate

• Applied Data Mining SCC413
 Category Computing module, available to all

This module provides students with up-to-date information on current applications of data in both industry and research. Expanding on the module ‘Fundamentals of Data’, students will gain a more detailed level of understanding about how data is processed and applied on a large scale across a variety of different areas.

#### Topics covered will include

• The Semantic Web: primer, crawling and spidering Linked Data, open-track large-scale problems (e.g. Billion Triples Challenge), distributed and federated querying, distributed reasoning, ontology alignment
• The Social Web: primer, user-generated content and crowd-sourced data, social networks (theories, analysis), recommendation (collaborative filtering, content recommendation challenges, and friend recommendation/link prediction)
• The Scientific Web: from big data to bid science, open data, citizen science, and case studies (virtual environmental observatories, collaboration networks)
• Scalable data processing: primer, scaling the semantic web (scaling distributed reasoning using MapReduce), scaling the social web (collaborative filtering, link prediction), and scalable network analysis for the scientific web

#### On successful completion of this module students will be able to

• Create scalable solutions to problems involving data from the semantic, social and scientific web
• Process networks and perform network analysis to identify key actors in information flow;
• • Understand the current trends of research in the semantic, social and scientific web and what challenges still remain to be solved
• Demonstrate working knowledge of distributing work loads for scalable applications.

## Elective Modules

You will also study a number of elective modules (2 or 3) worth a total of 30 credits. You can choose to follow the designated pathways, or you can formulate a bespoke programme of study; self-selecting modules from across pathways subject to pre-requisites and scheduling.

## Accordion

• Societal Pathway: Multi-Level Models
 Pathway Societal Category Statistics module, available to all Credits 10 credits

The aim of this module is to introduce how to analyse data that has a multi-level, hierarchical structure. The mathematical form of multilevel models is described. The models are developed first for continuous outcomes moving from linear regression to the random intercept model to the random coefficient model. Multilevel models are then shown for binary and other outcomes. Software implementation is described with the lme4 package in R. Some use of MLwiN is also made.

#### Topics covered will include

• The intra class correlation coefficient
• Two level random intercept and random coefficient models with continuous outcomes
• Checking model assumptions and residual diagnostics
• Models with three or more levels
• Generalized multilevel models including two-level logistic regression models, multilevel ordinal logistic regression models, and multilevel Poisson regression models
• Worked examples are shown of fitting such models in statistical software (mainly in R, but also some in MLwiN)
• Students will also gain insight into that there are different estimation algorithms available for multilevel models

#### On successful completion students will be able to

• Comprehend the notation used to describe multilevel models
• Demonstrate knowledge of multilevel models by formulating appropriate models to answer specific questions
• Demonstrate and understand how to use statistical software to fit multilevel models and how to interpret the relevant output
• Demonstrate how to perform model diagnostics for such models
• Be able to interpret the results of fitting multilevel models

#### Bibliography

• Bryk, A. S., and Raudenbush, S. W., (1992) Hierarchical Linear Models, Sage
• Goldstein, H., (2003) Multilevel Statistical Models. London, Edward Arnold
• Holmes Finch W, Bolin JE, Kelley K. Multilevel Modeling Using R. Chapman & Hall. 2014
• Hox, J., (2002) Multilevel Analysis: Techniques and Applications, Malwah, N.J: Lawrence Erlbourn Associates
• Longford, N. T., (1993) Random Coefficient Models. Oxford University Press
• Snijders, T. A. B., and Bosker, R. J., (1999) Multilevel Analysis. An Introduction to Basic and Advanced Multilevel Modelling. London: Sage
• Societal Pathway: Methods for Missing Data
 Pathway Societal Category Statistics module, available to all Credits 10 credits

This module deals with the ubiquitous and often neglected problem of dealing with missing data, common in many types of statistical analysis. We survey some ad-hoc strategies to deal with them and show how the can lead to bias and inefficiencies. We advocate using a principled approach and the formulating of the inherent missing data mechanism. We look at several principled methods of dealing with missing data. First we present a fully Bayesian approach using Winbugs. Secondly we create multiply imputed datasets using chained equation and then apply Rubin’s rules for combing the analyses of the models. We then do the same thing as the previous method but use multivariate techniques rather than chained equations as the method of multiple imputation. Finally we look at examples where no imputation is needed at all. All of the methods will be illustrated through good examples using the appropriate tools for exploration and diagnostics. We will also touch on models for imputation for hierarchical models when a mixed effects.

#### Topics covered will include

• The missing data mechanisms: Illustration using directed graphical models and exploration of the missingness models using appropriate software
• A survey of Ad-Hoc methods illustrating their drawbacks
• Missing data in the covariates or explanatory variables
• Full Bayesian imputation using WinBugs to demonstrate the role of the three models (the model for missingness, the imputation model and the substantive model)
• Multiple imputation using chained equations and multivariate methods
• Rubin’s rules for combining the modelling of multiply imputed datasets
• Diagnostics of the imputation process
• A survey of methods of dealing with missingness in hierarchical datasets

#### On successful completion students will be able to

• To demonstrate mastery of tools for exploring the missingness patterns using VIM and mice software libraries for R
• To formulate a possible missing data mechanism, for a given scenario, and to identify cases where the missing data mechanism is ignorable
• To formulate and differentiate: the model for missingness, the imputation model and the substantive model (model of interest)
• To be able to differentiate between sampling and parameter uncertainty and to recognise that the predictive distribution of the missing data incorporates both types of uncertainty
• To implement some naive methods for dealing with missingness (such as single imputation or list wise deletion), to recognise the limitations of each methods and identify situations where their use may be appropriate
• To be able to explain the differences between a multivariate imputation model and one using chained equations
• To estimate the between imputation variability and the within imputation variability and to combine in a sensible way to estimate the total variability and the fraction of information lost through missingness

#### Bibliography

• Stef van Buuren, 2012 Flexible Imputation of Missing Data, (Chapman & Hall/CRC Interdisciplinary Statistics Series)
• James R. Carpenter and Michael G. Kenward , 2013. Multiple Imputation and Its Application (Statistics in Practice). Wiley
• Socieatal Pathway: Structural Equation Modelling
 Pathway Societal Category Statistics module, available to all Credits 10 credits

This module will introduce participants to latent variables (variables which are not directly measured themselves) and to the use of factor analysis in investigating relationships between latent variables and observed, or measured, variables. These techniques will then be extended into the wider area of structural equation modelling, where complex models involving several latent variables will be introduced.

The module is aimed at researchers and research students who have experience of statistical modelling (up to linear regression) and hypothesis testing, who wish to develop techniques to analyse more complex data involving latent variables. The aim of the module is to provide a background of theory with opportunities to apply the techniques in practice, and each session will consist of a lecture/ demonstration and a practical. The software packages used will be IBM SPSS and AMOS, no prior knowledge of the structural equation modelling package AMOS will be assumed.

#### Topics covered will include

• introduction to latent variables and measurement error
• exploratory and confirmatory factor analysis; measurement models
• structural equation models
• theoretical issues involved in the development and application of structural equation models

#### Learning

Students will learn through the application of concepts and techniques covered in the module to real data sets. Students will be encouraged to examine issues of substantive interest in these studies.

#### On successful completion students will be able to

• investigate data using factor analysis
• build and verify appropriate measurement models for latent constructs
• confirm hypotheses and develop structural equation models
• apply theoretical concepts
• identify and solve problems
• analyse data using appropriate techniques
• interpret statistical output

#### Assessment

One assignment (100%) to be submitted in the form of two reports covering all aspects of the module material. The projects involve investigating datasets that require the student to investigate a substantive issue using appropriate statistical techniques and interpreting the results.

#### Bibliography

• Byrne, B.M. (2010) Structural Equation Modelling with AMOS: Basic Concepts, Applications and Programming. New York: Routledge
• Kline, R. B. (2010) Principles and Practices of Structural Equation Modelling London: The Guildford Press
• Business Intelligence Pathway: Data Mining for Marketing, Sales and Finance
 Pathway Business Intelligence Category Statistics module, available to all Credits 10 credits

The course extends the concepts of statistical model building and the models from the Introductory Statistics module towards methods from machine learning and artificial intelligence.

#### Topics covered will include

• Introduction to Data Mining
• Data Mining Process: Methods for data exploration & manipulation; Methods for data reduction & feature selection; Evaluating Classification Accuracy
• Data Mining Methods for Classification: Logistic Regression; Decision Trees; Nearest neighbour classification; Artificial Neural Networks
• Data Mining applications in Credit Scoring]

#### On successful completion students will be able to

• Understand general modelling concepts in relation to complex models
• Use a wide range of data mining methods to handle data of different types & applications
• Understand how these methods may be applied in practical management contexts
• Use & apply SAS Enterprise Miner to deal with complexity and large datasets

#### Bibliography

• Tan, P. N., M. Steinbach, et al. (2005). Introduction to data mining. Boston, Pearson Addison Wesley
• Berry, M. J. A. and G. Linoff (2000). Mastering data mining: the art and science of customer relationship management. New York, NY [u.a.], Wiley Computer Publ
• Berry, M. J. A. and G. Linoff (2004). Data mining techniques: for marketing, sales, and customer relationship management. Indianapolis, Ind., Wiley Pub
• Linoff, G. and M. J. A. Berry (2001). Mining the Web: transforming customer data into customer value. New York, John Wiley & Sons
• Weiss, S. M. and N. Indurkhya (1998). Predictive data mining: a practical guide. San Francisco, Morgan Kaufmann Publishers
 Pathway Business Intelligence Category Statistics module, available to all Credits 10 credits

The module introduces time series and causal forecasting methods so that passing students will be able to prepare methodologically competent, understandable and concisely presented reports for clients. By the end of the course, students should be able to model causal and time series models, assess their accuracy and robustness and apply them in a real world problem domain.

#### Topics covered will include

• Introduction to Forecasting in Organisations: Extrapolative vs. Causal Forecasting; History & academic research in Forecasting; Forecasting case studies
• Data Exploration: Time Series Patterns; Univariate & Multivariate Visualisation; Naïve Forecasting Methods & Averages
• Exponential Smoothing Methods: Single, Seasonal & Trended Exponential Smoothing; Model Selection; Parameter Selection
• ARIMA Methods: AR-, MA-, ARMA and ARIMA Models; ARIMA Model specification & estimation; Automatic selection
• Time Series Regression : Simple & multiple regression on time series; Hypothesis testing; Model evaluation; Diagnostics
• Time Series Regression: Model specification and constraints; Dummy Variables, Lag, Non-linearities; Stationarity; Building regression models
• Applications in operations and marketing
• Judgmental Forecasting: Judgmental methods for forecasting; Biases and heuristics.

#### Bibliography

• Ord K. & Fildes R. (2013), Principles of Business Forecasting, South-Western Cengage Learning
• Business Intelligence Pathway: Optimisation and Heuristics
 Pathway Business Intelligence Category Statistics module, available to all Credits 10 credits

Optimisation, sometimes called mathematical programming, has applications in many fields, including Operational Research, Computer Science, Statistics, Finance, Engineering and the Physical Sciences. Commercial optimisation software is now capable of solving many industrial-scale problems to proven optimality. On the other hand, there are still many practical applications where finding a provably-optimal solution is not computationally viable. In such cases, heuristic methods can allow good solutions to be found within a reasonable computation time.

The course is designed to enable students to apply optimisation techniques to business problems. Building on the introduction to optimisation in MSCI502 and/or MSCI519, students will be introduced to different problem formulations and algorithmic methods to guide decision making in business and other organisations.

#### Topics covered will include

• Linear Programming
• Non-Linear Programming
• Integer and Mixed-Integer Programming
• Dynamic Programming
• Heuristics

#### On successful completion students will be able to

• know how to formulate problems as mathematical programs and solve them
• be aware of the power, and the limitations, of optimisation methods
• be able to carry out sensitivity analysis to see how robust the recommendation is
• be familiar with commercial software such as MPL, LINDO and EXCEL SOLVER
• be aware of major heuristic techniques and know when and how to apply them

#### Bibliography

• HP Williams (2013) Model Building in Mathematical Programming (5th edition). Wiley. ISBN: 978-1-118-44333-0 (pbk)
• J Kallrath & JM Wilson (1997) Business Optimisation Using Mathematical Programming. Macmillan. ISBN: 0-333-67623-8
• WL Winston (2004) Operations Research - Applications and Algorithms (4th edition). Thompson. ISBN: 978-0534380588
• DR Anderson, DJ Sweeney, TA Williams & M. Wisniewski (2008) An Introduction to Management Science. Cengage Learning. ISBN: 978-1844805952
• E.K. Burke & G. Kendall (eds.) (2005) Search Methodologies: Introductory Tutorials in Optimization and Decision Support Techniques. Springer
• Environmental Pathway: Geoinformatics
 Pathway Environmental Category Statistics module, available to all Credits 15 credits

This module introduces students to the fundamental principles of Geographical Information Systems (GIS) and Remote Sensing (RS) and shows how these complimentary technologies may be used to capture/derive, manipulate, integrate, analyse and display different forms of spatially-referenced environmental data. The module is highly vocational with theory-based lectures complimented by hands-on practical sessions using state-of-the-art software (ArcGIS & ERDAS Imagine).

In addition to the subject specific aims, the module provides students with a range of generic skills to synthesise geographical data, develop suitable approaches to problem solving, undertake independent learning (including time management) and present the results of analysis in novel graphical formats.

#### Topics covered will include

• Geoinformatics: definitions, components and the nature of spatial data
• Principles of RS: physical basis, sensors, platforms and systems
• Applications of RS
• Principles of GIS
• Vector GIS
• Raster GIS and spatial modelling
• Geoinformatics project design

#### On successful completion of this module students will be able to

• Recognise fundamental principles and applications of GIS and Remote Sensing
• Appreciate the strong linkages between these disciplines and their fusion to create meaningful spatially-referenced environmental information
• Appraise current and future potential applications
• Use state-of-the-art software packages such as ArcGIS and ERDAS Imagine
• Demonstrate project management skills through completion of a geoinformatics project
• Identify and retrieve spatial data from a variety of different sources
• Visualise analyse and interpret spatial data using simple and advanced approaches
• Conduct an independent piece of research

#### Bibliography

• Demers. M.N., 2009. GIS for Dummies
• Heywood, I, Cornelius, S and Carver, S, 2011. An Introduction to Geographical Information Systems (4e). Pearson
• Lillesand, T.M., Kiefer, R.W. and Chipman, J.W, 2008. Remote Sensing and Image Interpretation (6e). Wiley
• Longley, P.A, Goodchild, M.F, Maguire, D.J and Rhind, D.W, 2011. Geographic Information Systems & Science (3e). Wiley
• Environmental Pathway: Approches in Environmental Data Analytics
 Pathway Environmental Category Statistics module, available to all Credits 15 credits

This module introduces students to a range of techniques used in analysing and handling environmental data. The course will include how to discover and understand environmental datasets, how to store and manage them appropriately and how to use statistics to address research questions. Students will learn the importance of managing and analysing data in a robust and reproducible manner, and how to make data available to others for reuse. Students will also gain experience in interpreting and presenting the results of analyses through both written and oral assessments.

#### Topics covered include

• Introduction to environmental data
• Best practice in managing environmental datasets
• Data integration – common problems and solutions
• Principles of exploratory data analysis
• Use of regression and correlation with environmental data
• Reproducibility and data publication
• Introduction to final project
• Data management and analysis plans
• Data presentation and visualisation
• Writing skills and report structure

#### On successful completion of this module students will be able to

• Develop an appropriate plan to manage and analyse environmental datasets
• Integrate multiple environmental datasets to answer complex questions
• Conduct exploratory and inferential statistics to examine patterns in environmental data
• Produce a reproducible analysis of a dataset and demonstrate awareness of the importance of reproducibility
• Analyse a range of environmental datasets to answer environmental management questions using appropriate techniques to ensure correct management, analysis and interpretation of data
• Demonstrate a range of transferable skills including data management and basic statistics
• Explain the results of an environmental data analysis to a range of audiences orally and in writing

#### Bibliography

• Practical Statistics for Environmental and Biological Scientists, J. Townend. Published by Wiley (ISBN 9780471496656) 2002
• Statistics, D. Freedman, R. Pisani & R. Purves. Published by W. W. Norton & Company (ISBN 0393929728) 4th Revised Edition, 2014
• Data management for researchers: organise, maintain and share your data for research success, Briney, K. Published by Pelagic Publishing (ISBN 9781784270124) 2015
• Environmental Pathway: Modelling Environmental Processes
 Pathway Environmental Category Statistics module, available to all Credits 15 credits

This module provides an introduction to the basic principles and approaches to computer-aided modelling of environmental processes with applications to real environmental problems such as catchment modelling, pollutant dispersal in rivers and estuaries, population dynamics etc. More general, the module provides an introduction to general aspects of dynamic systems modelling including the role of uncertainty and data in the modelling process.

#### Topics covered include

• Introduction to modelling as a process and as evaluation of scientific hypotheses: approaches to modelling: the role of data and perceptions in the modelling process; the problems of badly defined systems in the context of modelling environmental processes; problems of scale (temporal and spatial) and uncertainty in quantifying environmental systems.
• The concept of dynamic system. First order linear systems, with the Nicholson blowfly dynamics and the Aggregated Dead Zone (ADZ) model of dispersion in a river used as practical case studies. Transfer function models, steady state gain and time constant; serial, parallel and feedback connections of first order systems. Block diagram analysis.
• Muskingum-Cunge, Lag and Route, and General Transfer Function models of flow in a river system
• Second order linear systems with the predator-prey equations and a climate model as practical examples; natural frequency and damping ratio; higher order systems.
• Linear vs. Nonlinear systems – basic introduction.

Throughout the course case studies and examples will be used to illustrate the material. Guest lecturers maybe invited to contribute depending on availability.

#### On successful completion of this module students will be able to

• Evaluate the principles and problems of computer aided modelling of environmental systems.
• Use contemporary industry standard numerical software for basic analysis and simulation of environmental systems.
• Communicate with mathematicians and numerical analysts in joint projects involving modelling.
• Understand the way in which simple mathematical concepts can be used to build models of environmental systems
• Undertake some simple modelling tasks, to analyse experimental data and interpret the modelling outcomes

#### Bibliography

The following texts may be useful if read with discretion.

• Young, P.C. (1993) Concise Encyclopaedia of Environmental Systems. Pergamon: Oxford (selected articles)*
• Young, P.C., Parkinson, S. and Lees, M.J. (1996) Simplicity out of complexity: Occam’s Razor revisited* Journal of Applied Statistics, 23, 165-210
• Young, P.C. Recursive Estimation and Time Series Analysis. An Introduction, Springer, 1984
• Bennett, R.J., Chorley, R.J. Environmental Systems, Philosophy, Analysis and Control, Methuen 1980*
• Hardisty, J. et al. Computerised Environmental Modelling, A practical introduction using Excel, Wiley, 1993
• Population Health Pathway: Principles of Epidemiology
 Pathway Population Health Category Statistics module, available to all Credits 10 credits

This course introduces the principles of epidemiology and the statistical methods applied in epidemiological studies. It also introduces important concepts related to study design and statistical modelling concepts such as confounding and mediation.

#### Topics covered will include

• The history of epidemiology and the role of statistics therein
• Measures of health and disease including incidence and prevalence
• Traditional approaches to controlling for confounding including matching and stratification
• Epidemiological study design including cohort studies, case-control studies, cross-sectional studies, ecological studies
• Making causal inferences in epidemiology including the use of directed acyclic graphs to describe confounding, collider bias, and mediation
• Properties of parameters such as odds ratios and risk ratio including collapsibility
• Critical appraisal of published epidemiological journal articles including an appreciation of their structure, and strengths and weaknesses

#### On successful completion students will be able to

• Define and calculate appropriate measures of disease incidence and prevalence
• Describe the key statistical issues in the design of ecological studies, case-control studies, cohort studies, and cross-sectional studies
• Discuss and implement strategies for dealing with confounding and mediation
• Define and estimate important parameters such as the risk difference, risk ratio, and odds ratio
• Discuss the strengths and weaknesses of a published epidemiological paper and summarise these for different audiences

#### Bibliography

• Clayton D. and Hills M. (1993) Statistical models in epidemiology. Oxford, Oxford University Press
• Rothman K.J., Greenland S. and Lash T.L. (2008) Modern Epidemiology. Lippincott, Williams and Wilkins, US
• Population Health Pathway: Environmental Epidemiology
 Pathway Population Health Category Statistics module, requires Statistics Modules II Credits 10 credits

This course aims to introduce students to statistical methods commonly used by epidemiologists and statisticians to investigate the relationship between risk of disease and environmental factors. Specifically the course concentrates on studies with a spatial component. A number of published studies will be used to illustrate the methods described, and students will learn how to perform similar analyses using the R statistical package. By the end of the course students have an awareness of methodology used in environmental epidemiology, including an appreciation of their limitations, and should be capable of a number of these analyses themselves.

#### Topics covered will include

• Introduction: Motivating examples for methods in course
• Spatial Point Processes: theory and methods for the analysis of point patterns in two - dimensional space
• Clustering of disease: case-control point-based methods and methods based on counts
• Spatial variation in risk: case-control and point-based methods; generalized additive models
• Disease mapping: investigating variation in risk with count data
• Geographical correlation studies: the ecological fallacy; relation with disease mapping
• Point source methods: Investigation of risk associated with distance from a point or line source, for point and count data
• Geostatistics: introduction to the analysis of geostatistical data. Kriging and spatial prediction

#### On successful completion students will be able to

• Define and give examples of spatial point processes; describe the first and second moments of a point process
• Define, estimate and calculate theoretical K functions for a spatial point process
• Test for spatial clustering of a point pattern using the K function
• Use generalised additive models to construct smooth maps of spatial variation in disease risk and interpret key model outputs
• Use Poisson regression to analyse area-level count data and interpret key model outputs
• Describe what is meant by the ecological fallacy
• Carry out simple analyses of case-control data in relation to a point source
• Gaussian geostatistical models including a Gaussian process random effect term
• Perform basic analyses of geostatistical data, define and interpret the variogram
• Recognise the difference between point process data, area-level data and geostatistical data
• Describe some practical issues involved in undertaking environmental epidemiology studies

#### Bibliography

• P.J. Diggle. Statistical Analysis of Spatial Point Patterns (2nd edition). London: Edward Arnold. 2003
• P.Elliott, M.Martuzzi and G. Shaddick, Spatial statistical methods in environmental epidemiology: a critique. Statistical methods in Medical Research, 4, 137-159, 1995
• P.Elliott, J. Wakefield, N. Best and D. Briggs (eds), Disease and Exposure Mapping. Oxford University Press, Oxford, 1999
• L. Waller and C.A. Gotway. Applied Spatial Statistics for Public Health Data. New York: Wiley, 2004
• Population Health Pathway: Modelling of Infectious Diseases
 Pathway Population Health Category Statistics module, available to all Credits 10 credits

This module aims to provide students with the necessary knowledge, and analytical and modelling skills to develop and fit mathematical transmission models to understand infection dynamics, explore interventions, and to inform control policy. It will also provide students with the ability to analyse outbreak information, and to implement transmission models using the R programming language. Students will gain experience of handling and linking epidemiological data relevant to infectious disease outbreaks. They will gain hands-on experience of developing transmission models, appropriate to a specific research question or epidemiological application, and of using those models for scenario exploration. Students will also gain experience in communicating and presenting epidemic models and their outputs.

#### Topics covered will include

• Construction of mathematical disease models appropriate to their purpose
• The differences between deterministic and stochastic infectious disease modelling frameworks
• The dynamical behavior of infectious disease models
• Statistical inference using infectious disease models
• Analysis and critical interpretation of infectious disease data for outbreak analysis
• Communication of disease models and interpretation of their output

#### On successful completion students will be able to

• Demonstrate a deep understanding of the role of mathematical modelling in epidemiology
• Take a critical approach to linking sources of epidemiological data required for infectious disease models
• Understand epidemiology of infectious disease and modelling literature
• Interpret modelling studies critically
• Take a responsible approach towards the use of mathematical modeling, and appreciate and the ethical and social impacts of research and practice within this subject area

#### Bibliography

• Keeling MJ, Rohani P. Modelling Infectious Diseases in Humans and Animals. Princeton University Press. 2007
• Andersson H, Britton T. Stochastic Epidemic Models and their Statistical Analysis. Lecture Notes in Statistics. Springer. 2000
• Population Health Pathway: Survival and Event History Analysis
 Pathway Population Health Category Statistics module, requires Statistics Modules II Credits 10 credits

This course aims to describe the theory and to develop the practical skills required for the analysis of medical studies leading to the observation of survival times or multiple failure times. By the end of the course students should be able to carry out sophisticated analyses of this type, should be aware of the variety of statistical models and methods now available, and understand the nature and importance of the underlying model assumptions.

In many medical applications interest lies in times to or between events. Examples include time from diagnosis of cancer to death, or times between epileptic seizures. This advanced course begins with a review of standard approaches to the analysis of possibly censored survival data. Survival models and estimation procedures are reviewed, and emphasis is placed on the underlying assumptions, how these might be evaluated through diagnostic methods and how robust the primary conclusions might be to their violation.

The course closes with a description of models and methods for the treatment of multivariate survival data, such as repeated failures, the lifetimes of family members or competing risks. Stratified models, marginal models and frailty models are discussed.

### Topics covered will include

• Survival data. Censoring. Survival, hazard and cumulative hazard functions. Kaplan-Meier plots. Parametric models and likelihood construction. Cox’s proportional hazards model, partial likelihood, Nelson-Aalen estimators. Survival time prediction
• Diagnostic methods. Schoenfeld and other residuals. Testing the proportional hazards assumption. Detecting changes in covariate effects
• Frailty models and effects. Identifiability and estimation. Competing risks. Marginal models for clustered survival data

### On successful completion students will be able to

• Apply a range of appropriate statistical techniques to survival and event history data using statistical software
• Accurately interpret the output of statistical analyses using survival models fitted using standard software
• Construct and manipulate likelihood functions from parametric models for censored data
• Identify when particular models are appropriate through the application of diagnostic checks and model building strategies

#### Bibliography

• P. Hougaard, Analysis of Multivariate Survival Data. Springer, 2000
• T.M. Therneau and P.M. Grambsch, Modelling Survival Data: Extending the Cox Model. Springer, 2000
• T.H. Fleming, and D.P. Harrington, Counting processes and survival analysis. Wiley, 1991
• Population Health Pathway: Longitudinal Data Analysis
 Pathway Population Health Category Statistics module, requires Statistics Modules II Credits 10 credits

Longitudinal data arise when a time-sequence of measurements is made on a response variable for each of a number of subjects in an experiment or observational study. For example, a patient's blood pressure may be measured daily following administration of one of several medical treatments for hypertension. The practical objective of many longitudinal studies is to find out how the average value of the response varies over time, and how this average response profile is affected by different experimental treatments. This module presents an approach to the analysis of longitudinal data, based on statistical modelling and likelihood methods of parameter estimation and hypothesis testing.

The specific aim of this module is to teach students a modern approach to the analysis of longitudinal data. Upon completion of this course the students should have acquired, from lectures and practical classes, the ability to build statistical models for longitudinal data, and to draw valid conclusions from their models.

#### Topics covered will include

• What is longitudinal data?
• Exploratory and simple analysis strategies
• Normal linear model with correlated errors
• Linear mixed effects models
• Non-normal responses with GLMs
• Dealing with dropout

#### On Successful completion students will be able to

• Explain the differences between longitudinal studies and cross-sectional studies
• Select appropriate techniques to explore data
• Compare different approaches to estimation and their usage in the analysis
• Build statistical models for longitudinal data and to draw valid conclusions from their models
• Express the problems arising in longitudinal studies in mathematical language
• Use computer packages in statistical modeling and analysis of longitudinal data
• Summarise the findings in writing and present to wider audience

#### Bibliography

• H. Brown and R. Prescott, Applied Mixed Models in Medicine, Wiley, 1999
• P.J. Diggle, P. Heagerty, K.Y. Liang and S.L. Zeger, Analysis of Longitudinal Data (second edition), Oxford University Press, 2002
• G.M. Fitzmaurice, N. M. Laird and J. H. Ware, Applied Longitudinal Analysis, Wiley Series in Probability and Statistics, 2004
• G. Verbeke and G. Molenberghs, Linear Mixed Models for Longitudinal Data, Springer, 2000
• R. E. Weiss, Modelling longitudinal data, Springer, 2005
• Bioinformatics Pathway: Principles of Epidemiology
 Pathway Bioinformatics Category Statistics module, available to all Credits 10 credits

This course introduces the principles of epidemiology and the statistical methods applied in epidemiological studies. It also introduces important concepts related to study design and statistical modelling concepts such as confounding and mediation.

#### Topics covered will include

• The history of epidemiology and the role of statistics therein
• Measures of health and disease including incidence and prevalence
• Traditional approaches to controlling for confounding including matching and stratification
• Epidemiological study design including cohort studies, case-control studies, cross-sectional studies, ecological studies
• Making causal inferences in epidemiology including the use of directed acyclic graphs to describe confounding, collider bias, and mediation
• Properties of parameters such as odds ratios and risk ratio including collapsibility
• Critical appraisal of published epidemiological journal articles including an appreciation of their structure, and strengths and weaknesses

#### On successful completion students will be able to

• Define and calculate appropriate measures of disease incidence and prevalence
• Describe the key statistical issues in the design of ecological studies, case-control studies, cohort studies, and cross-sectional studies
• Discuss and implement strategies for dealing with confounding and mediation
• Define and estimate important parameters such as the risk difference, risk ratio, and odds ratio
• Discuss the strengths and weaknesses of a published epidemiological paper and summarise these for different audiences

#### Bibliography

• Clayton D. and Hills M. (1993) Statistical models in epidemiology. Oxford, Oxford University Press
• Rothman K.J., Greenland S. and Lash T.L. (2008) Modern Epidemiology. Lippincott, Williams and Wilkins, US
• Bioinformatics Pathway: Design and Analysis of Clinical Trials
 Pathway Bioinformatics Category Statistics module, available to all Credits 10 credits

This course aims to introduce students to aspects of statistics, which are important in the design and analysis of clinical trials.

Clinical trials are planned experiments on human beings designed to assess the relative benefits of one or more forms of treatment. For instance, we might be interested in studying whether aspirin reduces the incidence of pregnancy-induced hypertension; or we may wish to assess whether a new immunosuppressive drug improves the survival rate of transplant recipients. On completion of the module students should understand the basic elements of clinical trials, be able to recognise and use principles of good study design, and be able to analyse and interpret study results to make correct scientific inferences.

#### Topics covered will include

• Clinical trials fundamentals: trial terminology, principles of sound study design and ethics
• Defining and estimating treatment effects: continuous and binary data
• Crossover trials: motivation, design issues and analyses
• Sample size determination; continuous and binary data
• Equivalence and Non-inferiority trials
• Systematic reviews and Meta Analysis

#### On successful completion students will be able to

• Understand the basic elements of clinical trials
• Recognise and use principles of good study design, and be able to analyse and interpret study results to make correct scientific inferences
• Determine the different approaches that can be taken in addressing clinical questions related to the effectiveness of treatments and other types of interventions

#### Bibliography

• D.G. Altman, Practical Statistics for Medical Research, Chapman and Hall, 1991
• S. Senn, Cross-over trials in clinical research, Wiley, 1993
• S. Piantadosi, Clinical Trials: A Methodologic Perspective, John Wiley & Sons, 1997
• ICH Harmonised Tripartite Guidelines
• J.N.S. Matthews, Introduction to Randomised Controlled Clinical Trials, Arnold, 2000
• Bioinformatics Pathway: Bioinformatics
 Pathway Bioinformatics Category Statistics module, available to all Credits 15 credits

This course will equip students with a working knowledge of the main themes in bioinformatics. On successful completion, students should be confident and competent in all aspects of bioinformatics that can be executed via the web or on software running on Windows/Mac systems. They will have an understanding of the theoretical algorithms that underpin the various software applications that they use, and be able to perform bioinformatics within their own biological sub-field. More generally, this module also aims to encourage students to access and evaluate information from a variety of sources and to communicate the principles in a way that is well-organised, topical and recognises the limits of current hypotheses. It also aims to equip students with practical techniques including data collection, analysis and interpretation.

#### Topics covered will include

• Reading lists and how to manage reading. Doing a PubMed search
• The foundations of Bioinformatics
• Advanced bioinformatics I: Going deeper into algorithms
• Advanced bioinformatics II: Structural bioinformatics
• Advanced bioinformatics III: Phylogenetics : How do we use sequences to investigate evolution?
• Advanced bioinformatics IV: Detecting natural selection
• Advanced bioinformatics V: Processing deep sequencing data

#### On successful completion students will be able to

• Perform bioinformatics via the web GenBank, Pfam, Uniprot, PDB, SCOP. Use Artemis for genome visualization
• Download and align sequences, curate sequences, derive statistics on alignments. Use DNASp for sliding window analysis
• Building phylogenetic trees in MEGA. Use SimPlot for recombination analysis
• Structural bioinformatics – do homology modelling via SwissModel. Use a protein sequence viewer. Use Galaxy for deep sequence assembly
• Build a Bayesian phylogenetic tree with BEAST

#### Bibliography

• Michael Agostino. Practical Bioinformatics. Garland Science. ISBN 978-0-8153-4456-8
• Arthur M Lesk. Introduction to Bioinformatics 4th ed. Oxford Univ Press ISBN 978-0-19-965156-6
• Paul H Dear (ed) Bioinformatics. Scion ISBN 978-1-90-484216-3
• Masatoshi Nei & Sudhir Kumar Molecular evolution and phylogenetics (Available on http://lib.myilibrary.com/Open.aspx?id=83437)
• Drummond AJ & and Bouckaert RR. Bayesian Evolutionary Analysis with BEAST. Cambridge University Press ISBN 978-1107019652
• Bioinformatics Pathway: Statistical Genetics and Genomics
 Pathway Bioinformatics Category Statistics module, available to all Credits 10 credits

This module will give the students a working knowledge of recent statistical approaches for analyzing modern genomic and genetic datasets. The students will learn about significance testing for genetic variants using logistic regression, multiple testing correction using strategies such as Bonferroni and False discovery rate control, quantification of gene expression in RNA-seq data using expectation-maximization to determine ambiguous isoforms, differential expression testing using a negative binomial model, and Bayesian network models for gene regulation.

#### Topics covered will include

• Introduction to Molecular Biology
• Introduction to Human Genetics Studies
• Genome wide associations studies (QC, analysis, multiple testing correction, population stratification)
• RNA-Seq gene expression analysis
• Differential Gene Expression
• Statistical Models for gene regulation

#### On successful completion students will be able to

• Discuss the key aspects of genetics and genomics
• Define the statistical challenges in the analysis of genetics and genomics data
• Explain Genome-Wide Association Studies (GWAS) and how to find trait markers
• Perform a GWAS analysis and assess the significance of identified risk variants
• Identify differentially expressed genes in RNA-seq gene expression data
• Sketch the process of gene regulation and model it using statistical tools
• Understand the kinds of methods used in statistical genomics and genetics, including their limitations
• Analyse complex genetic and genomic datasets using statistical programming packages
• Perform a literature survey of statistical applications to a novel scientific field

#### Bibliography

• Computing Pathway: Systems Architecture and Integration
 Pathway Computing Category Computing module, available to all Credits 15 credits

In this module we explore the architectural approaches, techniques and technologies that underpin today's global IT infrastructure and particularly large-scale enterprise IT systems. It is one of two complementary modules that comprise the Systems stream of the Data Science MSc, which together provide a broad knowledge and context of systems architecture enabling students to assess new systems technologies, to know where technologies fit in the larger scheme of enterprise systems and state of the art research thinking, and to know what to read to go deeper.

The principal ethos of the module is to focus on the principles, emergent properties and application of systems elements as used in large-scale and high performance systems. Detailed case studies and invited industrial speakers will be used to provide supporting real-world context and a basis for interactive seminar discussions.

#### Topics to be covered will include

• Systems of systems composition
• Scalability concerns
• Systems integration/interoperability
• Software and Infrastructure as a Service (i.e. Cloud computing principles)

Supported by a consideration of emerging issues and implications arising from these new technologies:

• Commercial considerations
• Legal and ethical considerations
• New development and support paradigms, including open sourcing

In addition to the discussion and seminar led aspects of the course, we envisage ‘hands-on’ measurement-based coursework that looks empirically at the scalability of a significant technology, e.g. a cloud system such as Amazon EC2.

#### On successful completion of this module students will

• Demonstrate a deep understanding of the architectures and approaches for large-scale systems implementation
• Describe and critically evaluate techniques and paradigms used within enterprise-scale IT systems
• Understand and appreciate the trade-offs, strengths and limitations of systems architectures in principle and practice in modern IT systems.
• Computing Pathway: Elements of Distributed Systems
 Pathway Computing Category Computing module, available to all Credits 15 credits

Distributed artificial intelligence is fundamental in contemporary data analysis. Large volumes of data and computation call for multiple computers in problem solving. Being able to understand and use those resources efficiently is an important skill for a data scientist. A distributed approach is also important for fault-tolerance and robustness, as the loss of a single component must not significantly compromise the whole system. Additionally, contemporary and future distributed systems go beyond computer clusters and networks. Distributed systems are often comprised of multiple agents -- multiple software, humans and/or robots that all interact in problem solving. As a data scientist, we may have control of the full distributed system, or we may have control of only one piece, and we have to decide how it must behave in face of others in order to accomplish our goals.

Therefore, a strong data scientist must go beyond "passive" data analysis. Even a very accurate classification may become useless if it does not lead to high-performing decisions in actual problems. It is fundamental to use data to create systems that are able to behave in an intelligent manner, considering the presence of multiple actors, which may or may not be cooperative with our system. The "data" may be historical information stored in files or data-bases, as in classical machine learning; or it might be arriving continuously in an "on-line" way; or it might even be the system's own experience. All that must be used in the creation of intelligent systems.

Hence, in this module we will study how to use multiple agents for creating powerful machine learning systems. Furthermore, we will go beyond data classification, and will study how to take intelligent decisions autonomously given the presence of multiple actors, whether they are cooperative or not.

#### Topics to be covered will include

• Fundamental concepts of multi-agent systems
• Local coordination rules and emergence
• Ensemble Systems
• Decision Theory and Game Theory
• On-line learning
• Multi-agent Reinforcement Learning

#### On successful completion of this module students will be able to

• Understand the difference between single and multi-agent artificial intelligence; including the advantages and challenges of distribution
• Use a computer cluster for experimental work and data analysis
• Solve problems by using loose control, where local individual behaviour leads to complex self-organised systems
• Design systems that intelligently interact with others -- including those outside their control
• Design systems that learn from their own experience in an on-line way
• Improve classification/prediction performance by intelligently using multiple algorithms
• Read and critique research papers

#### Bibliography

The bibliography consists of research papers and course notes, which will be available during the course.