Course Structure

You will learn core skills in the first term, as the second and third terms allow you to shape the degree according to your interests and background.

You will begin with a series of core modules that are studied by all students. These core modules are augmented with other modules depending on your academic background. You can then further tailor the course with Specialism modules - Computing and Statistical Inference - each with its own range of designated pathways.

We encourage you to conclude your studies with a 3-month placement project with an external organisation. These generally attract a stipend of £3,000 and will provide you with the professional experience to stand out from the crowd.

If you would like to apply for one of our Data Science Masters degrees, you need to use Lancaster University's My Applications website.

Term 1: Core Modules

Term 1 provides core data scientific knowledge and skills training and is divided into five study modules, worth a total of 75 credits - 15 credits per module. You will study three Common Core data science modules that are compulsory, together with two Core statistics modules that are Specialism-specific and tailored according to your academic background - Statistics I or Statistics II.

Statistics Modules I is for students with a degree in Mathematics and/or Statistics. Statistics Modules II is for students with A-level Mathematics or equivalent.

  • Common Core Modules: Data Science Fundamentals SCC460

    This module teaches students about how data science is performed within academic and industry (via invited talks), research methods and how different research strategies are applied across different disciplines, and data science techniques for processing and analysing data. Students will engage in group project work, based on project briefs provided by industrial speakers, within multi-skilled teams (e.g. computing students, statistics students, environmental science students) in order to apply their data science skills to researching and solving an industrial data science problem.

    Topics covered will include

    • The role of the data scientist and the evolving epistemology of data science
    • The language of research, how to form research questions, writing literature reviews, and variance of research strategies across disciplines
    • Ethics surrounding data collection and re-sharing, and unwanted inferences
    • Identifying potential data sources and the data acquisition processes
    • Defining and quantifying biases, and data preparation (e.g. cleaning, standardisation, etc.)
    • Choosing a potential model for data, understanding model requirements and constraints, specifying model properties a priori, and fitting models
    • Inspection of data and results using plots, and hypothesis and significance tests
    • Writing up and presenting findings

    Learning

    Students will learn through a series of group exercises around research studies and projects related to data science topics. Invited talks from industry tackling data science problems will be given to teach the students about the application of data science skills in industry and academia. Students will gain knowledge of:

    • Defining a research question and a hypothesis to be tested, and choosing an appropriate research strategy to test that hypothesis
    • Analysing datasets provided in heterogeneous forms using a range of statistical techniques
    • How to relate potential data sources to a given research question, acquire such data and integrate it together
    • Designing and performing appropriate experiments given a research question
    • Implementing appropriate models for experiments and ensuring that the model is tested in the correct manner
    • Analysing experimental findings and relating these findings back to the original research goal

    Recommended texts and other learning resources

    • O'Neil. C., and Schutt. R. (2013) Doing Data Science: Straight Talk from the Frontline. O’Reilly
    • Trochim. W. (2006) The Research Methods Knowledge Base. Cenage Learning
  • Common Core Modules: Programming for Data Scientists SCC461

    This module is designed for students that are completely new to programming, and for experienced programmers, bringing them both to a high-skilled level to handle complex data science problems. Beginner students will learn the fundamentals of programming, while experienced students will have the opportunity to sharpen and further develop their programming skills. The students are going to learn data-processing techniques, including visualisation and statistical data analysis. For a broad formation, in order to handle the most complex data science tasks, we will also cover problem solving, and the development of graphical applications.

    In particular students will gain experience with two very important open source languages: R and Python. R is the best language for statistical analysis, being widely applied in academia and industry to handle a variety of different problems. Being able to program in R gives the data scientists access to the best and most updated libraries for handling a variety of classical and state of the art statistical methods. Python, on the other hand, is a general purpose programming language, also widely used for three main reasons: it is easy to learn, being recommended as a "first" programming language; it allows easy and quick development of applications; it has a great variety of useful and open libraries. For those reasons, Python has also been widely applied for scientific computing and data analysis. Additionally, Python enables the data scientist to easily develop other kinds of useful applications: for example, searching for optimal decisions given a data-set, graphical applications for data gathering, or even programming Raspberry Pi devices in order to create sensors or robots for data collection. Therefore, learning these two languages will not only enable the students to develop programming skills, but it will also give them direct access to two fundamental languages for contemporary data analysis, scientific computing, and general programming.

    Additionally, students will gain experience by working through exercise tasks and discussing their work with their peers; thereby fostering interpersonal communications skills. Students that are new to programming will find help in their experience peers, and experienced programmers will learn how to assist and explain the fundamental concepts to beginners.

    Topics covered will include

    • Fundamental programming concepts (statements, variables, functions, loops, etc)
    • Data abstraction (modules, classes, objects, etc)
    • Problem-solving
    • Using libraries for developing applications (e.g., SciPy, PyGames)
    • Performing statistical analysis and data visualisation

    On successful completion of this module students will be able to

    • Solve data science problems in an automatic fashion
    • Handle complex data-sets, which cannot be easily analysed "by hand"
    • Use existing libraries and/or develop their own libraries
    • Learn new programming languages, given the background knowledge of two important ones

    Bibliography

  • Common Core Modules: Data Mining SCC403

    This module will provide a comprehensive coverage of the problems related to Data representation, storage, manipulation, retrieval and processing in terms of extracting information from the data. It has been designed to provide a fundamental theoretical level of knowledge and skills (at the related laboratory sessions) to this specific aspect of Data Science, which plays an important role in any system and application. In this way it prepares students for the second module on the topic of Data as well as for their projects.

    Topics to be covered will include

    • Data Primer: Setting the scene: Big Data, Cloud Computing; The time, storage and computing power compromise: off-line versus on-line
    • Data Representations
    • Storage Paradigms
    • Vector-space models
    • Hierarchical (agglomerative/diversive)
    • k means
    • SQL and Relational Data Structures (short refresher)
    • NoSQL: Document stores, graph databases
    • Inference and reasoning
    • Associative and Fuzzy Rules
    • Inference mechanisms
    • Data Processing
    • Clustering
    • Density-based, on-line, evolving
    • Classification
    • Randomness and determinism, frequentist and belief based approaches, probability density, recursive density estimation, averages and moments, important random signals, response of linear systems to random signals, random signal models
    • Discriminative (Linear Discriminant Analysis, Single Perceptron, Multi-layer Perceptron, Learning Vector Classifier, Support Vector Machines), Generative (Naive Bayes)
    • Supervised and unsupervised learning, online and offline systems, adaptive and evolving systems, evolving versus evolutionary systems, normalisation and standardisation
    • Fuzzy Rule-based Classifiers, Regression or Lable based classifiers
    • Self-learning Classifiers, evolving Classifiers, dynamic data space partitioning using evolving clustering and data clouds, monitoring the quality of the self-learning system online, evolving multi-model predictive systems
    • Semi-supervised Learning (Self-learning, evolving, Bootstrapping, Expectation-Maximisation, ensemble classifiers)
    • Information Extraction vs Retrieval

    On successful completion of this module students will

    • Demonstrate understanding of the concepts and specific methodologies for data representation and processing and their applications to practical problems
    • Analyse and synthesise effective methods and algorithms for data representation and processing
    • Develop software scripts that implement advanced data representation and processing and demonstrate their impact on the performance
    • List, explain and generalise the trade-offs of performance and complexity in designing practical solutions for problems of data representation and processing in terms of storage, time and computing power
  • Statistics Modules I: Statistical Methods and Modelling CFAS440

    The aim of this module will be to address the fundamentals of statistics for those who do not have a mathematics and statistics undergraduate degree. Building upon the pre-learning ‘mathematics for statistics’ module is delivered over five weeks via a series of lectures and practical’s. Students will develop an understanding of the theory behind core statistical topics; sampling, hypothesis testing, and modelling. They will also be putting this knowledge into practice, by applying it to real data to address research questions.

    The module is an amalgamation of three short courses and is taught in weeks 1-5.

    Topics covered will include

    • Statistical Methods; commonly used probability distributions, parameter estimation, sampling variability, hypothesis testing, basic measures of bivariate relationships
    • Generalised Linear Models; the general linear model and the least-squares method, logistic regression for binary responses, Poisson regression for count data. More broadly, how to build a flexible linear predictor to capture relationships of interest
    • These short courses are supported by tutorial sessions and office hours

    On Successful completion students will be able to

    • Comprehend the mathematical notation used in explaining probability and statistics
    • Demonstrate knowledge of basic principles in probability, statistical distributions, sampling and estimation
    • Make decisions on the appropriate way to test hypothesis, carry out the test and interpret the results
    • Demonstrate knowledge of the general linear model, the least-squares method of estimation, and the linear predictor. As well the extensions to generalised linear models for discrete data
    • Decide on the appropriate way to statistically address a research question
    • Carry out said statistical analyses, assessing model results and performance
    • Report their findings in context

    Assessment

    There will be three pieces of coursework:

    • One assessment for Statistical Methods; assessing understand and application of statistical concepts, and interpretation of results from hypothesis testing.
    • Two independently produced reports for Generalized Linear Models; centred on in-depth statistical analyses.

    Bibliography

    • Upton, G., & Cook, I. (1996). Understanding statistics. Oxford University Press
    • Rice, J. (2006). Mathematical statistics and data analysis. Cengage Learning
    • Dobson, A. J., & Barnett, A. G. (2008). An Introduction to Generalized Linear Models. CRC Press
    • Fox, J. (2008). Applied regression analysis and generalized linear models. Sage Publications
  • Statistics Modules I: Statistical Inference CFAS406

    This modules aims to provide an in-depth understanding of statistics as a general approach to the problem of making valid inferences about relationships from observational and experimental studies. The emphasis will be on the principle of Maximum Likelihood as a unifying theory for estimating parameters. The module is delivered as a combination of lectures and practical’s over four week.

    Topics covered will include

    • Revision of probability theory and parametric statistical models
    • The properties of statistical hypothesis tests, statistical estimation and sampling distributions
    • Maximum Likelihood Estimation of model parameters
    • Asymptotic distributions of the maximum likelihood estimator and associated statistics for use in hypothesis testing
    • Application of likelihood inference to simple statistical analyses including linear regression and contingency tables

    Learning

    Students will learn through the application of concepts and techniques covered in the module by application to real data sets. Students will be encouraged to examine issues of substantive interest in these studies. Students will acquire knowledge of:

    • Application of likelihood inference to simple statistical analyses including linear regression
    • The basic principles of probability theory
    • Maximum Likelihood as a theory for estimation and inference
    • The application of the methodology to hypothesis testing for model

    Students will, more generally, develop skills to

    • apply theoretical concepts
    • identify and solve problems

    Bibliography

    • Dobson, A. J. (1983). An Introduction to Statistical Modelling. Chapman and Hall
    • Eliason, S. R. (1993). Maximum Likelihood Estimation: Logic and Practice. Sage Publications
    • Pickles, A. (1984). An introduction to likelihood analysis. Geo Books
    • Pawitan, Y. (2001). In all likelihood: statistical modelling and inference using likelihood. Oxford University Press.
  • Statistics Modules II: Generalised Linear Models MATH552

    Generalised linear models are now one of the most frequently used statistical tools of the applied statistician. They extend the ideas of regression analysis to a wider class of problems that involves exploring the relationship between a response and one or more explanatory variables. In this course we aim to discuss applications of the generalised linear models to diverse range of practical problems involving data from the area of biology, social sciences and time series to name a few and to explore the theoretical basis of these models.

    Topics covered will include

    • We introduce a large family of models, called the generalised linear models (GLMs), that includes the standard linear regression model as a special case and we discuss the theory and application of these models
    • We learn an algorithm called iteratively reweighted least squares algorithm for the estimation of parameters
    • Formulation of sensible models for relationship between a response and one or more explanatory variables, taking into account of the motivation for data collection
    • We fit and check these models with the statistical package R; produce confidence intervals and tests corresponding to questions of interest; and state conclusions in everyday language

    On successful completion students will be able to

    • Define the components of GLM
    • Express standard models (Gaussian (Normal), Poisson,…) in GLM form
    • Derive relationships between mean and variance and parameters of an exponential family distribution
    • Specify design matrices for given problems
    • Define and interpret model deviance and degrees of freedom
    • Use model deviances to assist in model selection
    • Define deviance and Pearson residuals, and understand how to use them for checking model quality
    • Use R to fit standard (and appropriate) GLM’s to data
    • Understand and interpret R output for model selection and diagnosis, and draw appropriate scientific conclusions

    Bibliography

    • P. McCullagh and J. Nelder. Generalized Linear Models, Chapman and Hall, 1999
    • A.J. Dobson, An Introduction to Generalised Linear Models, Chapman and Hall, 1990
  • Statistics Modules II: Likelihood Inference MATH551

    This course considers the idea of statistical models and how the likelihood function, defined to be the probability of the observed data viewed as a function of unknown model parameters, can be used to make inference about those parameters. This inference includes both estimates of the values of these parameters, and measures of the uncertainty surrounding these estimates. We consider single and multi-parameter models, and models which do not assume the data are independent and identically distributed. We also cover computational aspects of likelihood inference that are required in many practical applications, including numerical optimization of likelihood functions and bootstrap methods to estimate uncertainty.

    Topics covered will include

    • Definition of the likelihood function for single and multi-parameter models, and how it is used to calculate point estimates (maximum likelihood estimates)
    • Asymptotic distribution of the maximum likelihood estimator, and the profile deviance, and how these are used to quantify uncertainty in estimates
    • Inter-relationships between parameters, and the definition and use of orthogonality
    • Generalised likelihood ratio statistics, and their use for hypothesis tests
    • Calculating likelihood functions for non-IID models
    • Use of computational methods in R to calculate maximum likelihood estimates and confidence intervals; perform hypothesis tests and calculate bootstrap confidence intervals

    On successful completion students will be able to

    • Understand how to construct statistical models for simple applications
    • Appreciate how information about the unknown parameters is obtained and summarized via the likelihood function
    • Calculate the likelihood function for basic statistical models
    • Evaluate point estimates and make statements about the variability of these estimates
    • Understand the inter-relationships between parameters, and the concept of orthogonality
    • Perform hypothesis tests using the generalised likelihood ratio statistic
    • Use computational methods to calculate maximum likelihood estimates
    • Use computational methods to construct both likelihood-based and bootstrapped confidence intervals, and perform hypothesis tests

    Bibliography

    • A Azzalini. Statistical Inference: Based on the Likelihood. Chapman and Hall. 1996
    • D R Cox and D V Hinkley. Theoretical Statistics. Chapman and Hall. 1974
    • Y Pawitan. In All Likelihood: Statistical Modeling and Inference Using Likelihood. OUP. 2001

Term 2: Optional Modules

Term 2 allows for further specialisation and application in areas in which there is a considerable demand for data scientists. You will study one module (15 credits) strengthening the foundations for your Specialism modules (Statistical Inference or Computing) together with a number of elective modules (30 credits) which can be chosen to form designated pathways:

  • Business Intelligence
  • Bioinformatics
  • Population Health
  • Environmental
  • Societal
  • Computing

The available Specialism and Elective modules (listed below) build upon the Core set and have a prerequisite level of skills and knowledge.

Specialism Modules

In the second term, you will build on your specialism with one of the following modules (15 credits each).

  • Bayesian Inference for Data Science MATH555
    Category Statistics module, requires Statistics Modules II

    This module aims to introduce the Bayesian view of statistics, stressing its philosophical contrasts with classical statistics, its facility for including information other than the data into the analysis and its coherent approach towards inference and model selection. The module will also introduce students to MCMC (Markov chain Monte Carlo), a computationally intensive method for efficiently applying Bayesian methods to complex models. By the end of the course the students should be able to formulate an appropriate prior for a variety of problems, calculate, simulate from and interpret the posterior and the predictive distribution, with or without MCMC as appropriate and to carry out Bayesian model selection using the marginal likelihood. Students should be able to carry out all of this using the programming language R.

    Topics covered will include

    • Inference by updating belief
    • The ingredients of Bayesian inference: the prior, the likelihood, the posterior, the predictive and the marginal distribution
    • Methods for formulating the prior
    • Conjugate priors for single parameter models
    • Normal distribution, known and unknown variance, regression
    • Sampling for the posterior and predictive distributions
    • Model checking and model selection
    • Gibbs sampling, data augmentation, hierarchical models
    • The Metropolis-Hastings algorithm, random walk Metropolis, independence sampler

    On Successful completion students will be able to

    • Understand the Bayesian statistical framework and its philosophy
    • Demonstrate knowledge of key concepts: the prior, the likelihood, the posterior, the predictive and the marginal distribution
    • Calculate, simulate from and interpret the posterior and the predictive distribution
    • Construct an MCMC algorithm for a variety of statistical models and implement them in R

    Bibliography

    • Hoff, P. (2008) A first course in Bayesian statistics. Springer
    • Gamerman, D. and Lopez, H. (2006) MCMC statistical simulation for Bayesian inference. Chapman and Hall 2nd Edition
    • Gilks, W.R., Richardson, S. and Spiegelhalter, D. (1996) Markov chain Monte Carlo in Practice. Chapman and Hall
  • Statistical Learning CFAS420
    Category Statistics module, available to all

    The module will provide students with the statistical tools needed to understand the analysis of large data sets, and the statistical background to such tools. It will seek to integrate the various methods used in such analysis into a common modelling framework. An important part of the course is on interpretation and on communicating the results to others.

    Topics covered will include

    • Introduction to statistical learning; problems of missing data, biased samples and recency
    • Statistical significance and big data
    • Sample splitting. Calibration, training and validation samples. Entropy and likelihood
    • Unsupervised learning: K-means, PAM and CLARA for big data. Mixture models. Latent class analysis
    • Variable reduction methods and variable selection. The Lasso
    • Classification methods: logistic and multinomial logistic models. Probability cutoffs; the ROC curve; sensitivity and specificity
    • Classification methods Regression trees, random forests and boosted trees
    • Classification methods Neural networks as generalised linear modelling extensions
    • Classification methods: other methods (PRIM)
    • Smoothing models through GAMs
    • Bayesian networks

    Learning

    Students will learn through the application of concepts and techniques covered in the module to real data sets. Students will be encouraged to examine issues of substantive interest in these studies.

    On successful completion students will be able to

    • Understand the need for a statistical basis for data analytics
    • Appreciate that different terminologies used in different data analytic technologies can be integrated through statistical modelling concepts and the idea of likelihood
    • Understand the tradeoff between interpretability and predictive performance
    • Have gained skills about the appropriate choice of statistical learning methods for various forms of real-life problem
    • Build predictions for logistic and multinomial logistic models
    • Choose an appropriate clustering method which has a statistical basis
    • Split big datasets appropriately, and understand the predictive performance should be based on the validation sample
    • Carry out a regression tree analysis including pruning, assessing its performance
    • Carry out more complex forms of regression tree ensemble techniques, including random forests
    • Carry out simple neural network analyses, while understanding the need for construction of a start value strategy

    Assessment

    Two assignments (100%) to be submitted in the form of reports covering all aspects of the module material. The projects involve the analysis of datasets that require the student to investigate

  • Applied Data Mining SCC413
    Category Computing module, available to all

    This module provides students with up-to-date information on current applications of data in both industry and research. Expanding on the module ‘Fundamentals of Data’, students will gain a more detailed level of understanding about how data is processed and applied on a large scale across a variety of different areas.

    Topics covered will include

    • The Semantic Web: primer, crawling and spidering Linked Data, open-track large-scale problems (e.g. Billion Triples Challenge), distributed and federated querying, distributed reasoning, ontology alignment
    • The Social Web: primer, user-generated content and crowd-sourced data, social networks (theories, analysis), recommendation (collaborative filtering, content recommendation challenges, and friend recommendation/link prediction)
    • The Scientific Web: from big data to bid science, open data, citizen science, and case studies (virtual environmental observatories, collaboration networks)
    • Scalable data processing: primer, scaling the semantic web (scaling distributed reasoning using MapReduce), scaling the social web (collaborative filtering, link prediction), and scalable network analysis for the scientific web

    On successful completion of this module students will be able to

    • Create scalable solutions to problems involving data from the semantic, social and scientific web
    • Process networks and perform network analysis to identify key actors in information flow;
    • • Understand the current trends of research in the semantic, social and scientific web and what challenges still remain to be solved
    • Demonstrate working knowledge of distributing work loads for scalable applications.

Elective Modules

You will also study a number of elective modules (2 or 3) worth a total of 30 credits. You can choose to follow the designated pathways, or you can formulate a bespoke programme of study; self-selecting modules from across pathways subject to pre-requisites and scheduling.

  • Societal Pathway: Multi-Level Models
    Pathway Societal
    Category Statistics module, available to all
    Credits 10 credits

    The aim of this module is to introduce how to analyse data that has a multi-level, hierarchical structure. The mathematical form of multilevel models is described. The models are developed first for continuous outcomes moving from linear regression to the random intercept model to the random coefficient model. Multilevel models are then shown for binary and other outcomes. Software implementation is described with the lme4 package in R. Some use of MLwiN is also made.

    Topics covered will include

    • The intra class correlation coefficient
    • Two level random intercept and random coefficient models with continuous outcomes
    • Checking model assumptions and residual diagnostics
    • Models with three or more levels
    • Generalized multilevel models including two-level logistic regression models, multilevel ordinal logistic regression models, and multilevel Poisson regression models
    • Worked examples are shown of fitting such models in statistical software (mainly in R, but also some in MLwiN)
    • Students will also gain insight into that there are different estimation algorithms available for multilevel models

    On successful completion students will be able to

    • Comprehend the notation used to describe multilevel models
    • Demonstrate knowledge of multilevel models by formulating appropriate models to answer specific questions
    • Demonstrate and understand how to use statistical software to fit multilevel models and how to interpret the relevant output
    • Demonstrate how to perform model diagnostics for such models
    • Be able to interpret the results of fitting multilevel models

    Bibliography

    • Bryk, A. S., and Raudenbush, S. W., (1992) Hierarchical Linear Models, Sage
    • Goldstein, H., (2003) Multilevel Statistical Models. London, Edward Arnold
    • Holmes Finch W, Bolin JE, Kelley K. Multilevel Modeling Using R. Chapman & Hall. 2014
    • Hox, J., (2002) Multilevel Analysis: Techniques and Applications, Malwah, N.J: Lawrence Erlbourn Associates
    • Longford, N. T., (1993) Random Coefficient Models. Oxford University Press
    • Snijders, T. A. B., and Bosker, R. J., (1999) Multilevel Analysis. An Introduction to Basic and Advanced Multilevel Modelling. London: Sage
  • Societal Pathway: Methods for Missing Data
    Pathway Societal
    Category Statistics module, available to all
    Credits 10 credits

    This module deals with the ubiquitous and often neglected problem of dealing with missing data, common in many types of statistical analysis. We survey some ad-hoc strategies to deal with them and show how the can lead to bias and inefficiencies. We advocate using a principled approach and the formulating of the inherent missing data mechanism. We look at several principled methods of dealing with missing data. First we present a fully Bayesian approach using Winbugs. Secondly we create multiply imputed datasets using chained equation and then apply Rubin’s rules for combing the analyses of the models. We then do the same thing as the previous method but use multivariate techniques rather than chained equations as the method of multiple imputation. Finally we look at examples where no imputation is needed at all. All of the methods will be illustrated through good examples using the appropriate tools for exploration and diagnostics. We will also touch on models for imputation for hierarchical models when a mixed effects.

    Topics covered will include

    • The missing data mechanisms: Illustration using directed graphical models and exploration of the missingness models using appropriate software
    • A survey of Ad-Hoc methods illustrating their drawbacks
    • Missing data in the covariates or explanatory variables
    • Full Bayesian imputation using WinBugs to demonstrate the role of the three models (the model for missingness, the imputation model and the substantive model)
    • Multiple imputation using chained equations and multivariate methods
    • Rubin’s rules for combining the modelling of multiply imputed datasets
    • Diagnostics of the imputation process
    • A survey of methods of dealing with missingness in hierarchical datasets

    On successful completion students will be able to

    • To demonstrate mastery of tools for exploring the missingness patterns using VIM and mice software libraries for R
    • To formulate a possible missing data mechanism, for a given scenario, and to identify cases where the missing data mechanism is ignorable
    • To formulate and differentiate: the model for missingness, the imputation model and the substantive model (model of interest)
    • To be able to differentiate between sampling and parameter uncertainty and to recognise that the predictive distribution of the missing data incorporates both types of uncertainty
    • To implement some naive methods for dealing with missingness (such as single imputation or list wise deletion), to recognise the limitations of each methods and identify situations where their use may be appropriate
    • To be able to explain the differences between a multivariate imputation model and one using chained equations
    • To estimate the between imputation variability and the within imputation variability and to combine in a sensible way to estimate the total variability and the fraction of information lost through missingness

    Bibliography

    • Stef van Buuren, 2012 Flexible Imputation of Missing Data, (Chapman & Hall/CRC Interdisciplinary Statistics Series)
    • James R. Carpenter and Michael G. Kenward , 2013. Multiple Imputation and Its Application (Statistics in Practice). Wiley
  • Socieatal Pathway: Structural Equation Modelling
    Pathway Societal
    Category Statistics module, available to all
    Credits 10 credits

    This module will introduce participants to latent variables (variables which are not directly measured themselves) and to the use of factor analysis in investigating relationships between latent variables and observed, or measured, variables. These techniques will then be extended into the wider area of structural equation modelling, where complex models involving several latent variables will be introduced.

    The module is aimed at researchers and research students who have experience of statistical modelling (up to linear regression) and hypothesis testing, who wish to develop techniques to analyse more complex data involving latent variables. The aim of the module is to provide a background of theory with opportunities to apply the techniques in practice, and each session will consist of a lecture/ demonstration and a practical. The software packages used will be IBM SPSS and AMOS, no prior knowledge of the structural equation modelling package AMOS will be assumed.

    Topics covered will include

    • introduction to latent variables and measurement error
    • exploratory and confirmatory factor analysis; measurement models
    • structural equation models
    • theoretical issues involved in the development and application of structural equation models

    Learning

    Students will learn through the application of concepts and techniques covered in the module to real data sets. Students will be encouraged to examine issues of substantive interest in these studies.

    On successful completion students will be able to

    • investigate data using factor analysis
    • build and verify appropriate measurement models for latent constructs
    • confirm hypotheses and develop structural equation models
    • apply theoretical concepts
    • identify and solve problems
    • analyse data using appropriate techniques
    • interpret statistical output

    Assessment

    One assignment (100%) to be submitted in the form of two reports covering all aspects of the module material. The projects involve investigating datasets that require the student to investigate a substantive issue using appropriate statistical techniques and interpreting the results.

    Bibliography

    • Byrne, B.M. (2010) Structural Equation Modelling with AMOS: Basic Concepts, Applications and Programming. New York: Routledge
    • Kline, R. B. (2010) Principles and Practices of Structural Equation Modelling London: The Guildford Press
  • Business Intelligence Pathway: Data Mining for Marketing, Sales and Finance
    Pathway Business Intelligence
    Category Statistics module, available to all
    Credits 10 credits

    The course extends the concepts of statistical model building and the models from the Introductory Statistics module towards methods from machine learning and artificial intelligence.

    Topics covered will include

    • Introduction to Data Mining
    • Data Mining Process: Methods for data exploration & manipulation; Methods for data reduction & feature selection; Evaluating Classification Accuracy
    • Data Mining Methods for Classification: Logistic Regression; Decision Trees; Nearest neighbour classification; Artificial Neural Networks
    • Data Mining applications in Credit Scoring]

    On successful completion students will be able to

    • Understand general modelling concepts in relation to complex models
    • Use a wide range of data mining methods to handle data of different types & applications
    • Understand how these methods may be applied in practical management contexts
    • Use & apply SAS Enterprise Miner to deal with complexity and large datasets

    Bibliography

    • Tan, P. N., M. Steinbach, et al. (2005). Introduction to data mining. Boston, Pearson Addison Wesley
    • Berry, M. J. A. and G. Linoff (2000). Mastering data mining: the art and science of customer relationship management. New York, NY [u.a.], Wiley Computer Publ
    • Berry, M. J. A. and G. Linoff (2004). Data mining techniques: for marketing, sales, and customer relationship management. Indianapolis, Ind., Wiley Pub
    • Linoff, G. and M. J. A. Berry (2001). Mining the Web: transforming customer data into customer value. New York, John Wiley & Sons
    • Weiss, S. M. and N. Indurkhya (1998). Predictive data mining: a practical guide. San Francisco, Morgan Kaufmann Publishers
  • Business Intelligence Pathway: Forecasting
    Pathway Business Intelligence
    Category Statistics module, available to all
    Credits 10 credits

    The module introduces time series and causal forecasting methods so that passing students will be able to prepare methodologically competent, understandable and concisely presented reports for clients. By the end of the course, students should be able to model causal and time series models, assess their accuracy and robustness and apply them in a real world problem domain.

    Topics covered will include

    • Introduction to Forecasting in Organisations: Extrapolative vs. Causal Forecasting; History & academic research in Forecasting; Forecasting case studies
    • Data Exploration: Time Series Patterns; Univariate & Multivariate Visualisation; Naïve Forecasting Methods & Averages
    • Exponential Smoothing Methods: Single, Seasonal & Trended Exponential Smoothing; Model Selection; Parameter Selection
    • ARIMA Methods: AR-, MA-, ARMA and ARIMA Models; ARIMA Model specification & estimation; Automatic selection
    • Time Series Regression : Simple & multiple regression on time series; Hypothesis testing; Model evaluation; Diagnostics
    • Time Series Regression: Model specification and constraints; Dummy Variables, Lag, Non-linearities; Stationarity; Building regression models
    • Applications in operations and marketing
    • Judgmental Forecasting: Judgmental methods for forecasting; Biases and heuristics.

    Bibliography

    • Ord K. & Fildes R. (2013), Principles of Business Forecasting, South-Western Cengage Learning
  • Business Intelligence Pathway: Optimisation and Heuristics
    Pathway Business Intelligence
    Category Statistics module, available to all
    Credits 10 credits

    Optimisation, sometimes called mathematical programming, has applications in many fields, including Operational Research, Computer Science, Statistics, Finance, Engineering and the Physical Sciences. Commercial optimisation software is now capable of solving many industrial-scale problems to proven optimality. On the other hand, there are still many practical applications where finding a provably-optimal solution is not computationally viable. In such cases, heuristic methods can allow good solutions to be found within a reasonable computation time.

    The course is designed to enable students to apply optimisation techniques to business problems. Building on the introduction to optimisation in MSCI502 and/or MSCI519, students will be introduced to different problem formulations and algorithmic methods to guide decision making in business and other organisations.

    Topics covered will include

    • Linear Programming
    • Non-Linear Programming
    • Integer and Mixed-Integer Programming
    • Dynamic Programming
    • Heuristics

    On successful completion students will be able to

    • know how to formulate problems as mathematical programs and solve them
    • be aware of the power, and the limitations, of optimisation methods
    • be able to carry out sensitivity analysis to see how robust the recommendation is
    • be familiar with commercial software such as MPL, LINDO and EXCEL SOLVER
    • be aware of major heuristic techniques and know when and how to apply them

    Bibliography

    • HP Williams (2013) Model Building in Mathematical Programming (5th edition). Wiley. ISBN: 978-1-118-44333-0 (pbk)
    • J Kallrath & JM Wilson (1997) Business Optimisation Using Mathematical Programming. Macmillan. ISBN: 0-333-67623-8
    • WL Winston (2004) Operations Research - Applications and Algorithms (4th edition). Thompson. ISBN: 978-0534380588
    • DR Anderson, DJ Sweeney, TA Williams & M. Wisniewski (2008) An Introduction to Management Science. Cengage Learning. ISBN: 978-1844805952
    • E.K. Burke & G. Kendall (eds.) (2005) Search Methodologies: Introductory Tutorials in Optimization and Decision Support Techniques. Springer
  • Environmental Pathway: Geoinformatics
    Pathway Environmental
    Category Statistics module, available to all
    Credits 15 credits

    This module introduces students to the fundamental principles of Geographical Information Systems (GIS) and Remote Sensing (RS) and shows how these complimentary technologies may be used to capture/derive, manipulate, integrate, analyse and display different forms of spatially-referenced environmental data. The module is highly vocational with theory-based lectures complimented by hands-on practical sessions using state-of-the-art software (ArcGIS & ERDAS Imagine).

    In addition to the subject specific aims, the module provides students with a range of generic skills to synthesise geographical data, develop suitable approaches to problem solving, undertake independent learning (including time management) and present the results of analysis in novel graphical formats.

    Topics covered will include

    • Geoinformatics: definitions, components and the nature of spatial data
    • Principles of RS: physical basis, sensors, platforms and systems
    • Applications of RS
    • Principles of GIS
    • Vector GIS
    • Raster GIS and spatial modelling
    • Geoinformatics project design
    • Data Integration and Metadata

    On successful completion of this module students will be able to

    • Recognise fundamental principles and applications of GIS and Remote Sensing
    • Appreciate the strong linkages between these disciplines and their fusion to create meaningful spatially-referenced environmental information
    • Appraise current and future potential applications
    • Use state-of-the-art software packages such as ArcGIS and ERDAS Imagine
    • Demonstrate project management skills through completion of a geoinformatics project
    • Identify and retrieve spatial data from a variety of different sources
    • Visualise analyse and interpret spatial data using simple and advanced approaches
    • Conduct an independent piece of research

    Bibliography

    • Demers. M.N., 2009. GIS for Dummies
    • Heywood, I, Cornelius, S and Carver, S, 2011. An Introduction to Geographical Information Systems (4e). Pearson
    • Lillesand, T.M., Kiefer, R.W. and Chipman, J.W, 2008. Remote Sensing and Image Interpretation (6e). Wiley
    • Longley, P.A, Goodchild, M.F, Maguire, D.J and Rhind, D.W, 2011. Geographic Information Systems & Science (3e). Wiley
  • Environmental Pathway: Approches in Environmental Data Analytics
    Pathway Environmental
    Category Statistics module, available to all
    Credits 15 credits

    This module introduces students to a range of techniques used in analysing and handling environmental data. The course will include how to discover and understand environmental datasets, how to store and manage them appropriately and how to use statistics to address research questions. Students will learn the importance of managing and analysing data in a robust and reproducible manner, and how to make data available to others for reuse. Students will also gain experience in interpreting and presenting the results of analyses through both written and oral assessments.

    Topics covered include

    • Introduction to environmental data
    • Best practice in managing environmental datasets
    • Data integration – common problems and solutions
    • Principles of exploratory data analysis
    • Use of regression and correlation with environmental data
    • Reproducibility and data publication
    • Introduction to final project
    • Data management and analysis plans
    • Data presentation and visualisation
    • Writing skills and report structure

    On successful completion of this module students will be able to

    • Develop an appropriate plan to manage and analyse environmental datasets
    • Integrate multiple environmental datasets to answer complex questions
    • Conduct exploratory and inferential statistics to examine patterns in environmental data
    • Produce a reproducible analysis of a dataset and demonstrate awareness of the importance of reproducibility
    • Analyse a range of environmental datasets to answer environmental management questions using appropriate techniques to ensure correct management, analysis and interpretation of data
    • Demonstrate a range of transferable skills including data management and basic statistics
    • Explain the results of an environmental data analysis to a range of audiences orally and in writing

    Bibliography

    • Practical Statistics for Environmental and Biological Scientists, J. Townend. Published by Wiley (ISBN 9780471496656) 2002
    • Statistics, D. Freedman, R. Pisani & R. Purves. Published by W. W. Norton & Company (ISBN 0393929728) 4th Revised Edition, 2014
    • Data management for researchers: organise, maintain and share your data for research success, Briney, K. Published by Pelagic Publishing (ISBN 9781784270124) 2015
  • Environmental Pathway: Modelling Environmental Processes
    Pathway Environmental
    Category Statistics module, available to all
    Credits 15 credits

    This module provides an introduction to the basic principles and approaches to computer-aided modelling of environmental processes with applications to real environmental problems such as catchment modelling, pollutant dispersal in rivers and estuaries, population dynamics etc. More general, the module provides an introduction to general aspects of dynamic systems modelling including the role of uncertainty and data in the modelling process.

    Topics covered include

    • Introduction to modelling as a process and as evaluation of scientific hypotheses: approaches to modelling: the role of data and perceptions in the modelling process; the problems of badly defined systems in the context of modelling environmental processes; problems of scale (temporal and spatial) and uncertainty in quantifying environmental systems.
    • The concept of dynamic system. First order linear systems, with the Nicholson blowfly dynamics and the Aggregated Dead Zone (ADZ) model of dispersion in a river used as practical case studies. Transfer function models, steady state gain and time constant; serial, parallel and feedback connections of first order systems. Block diagram analysis.
    • Muskingum-Cunge, Lag and Route, and General Transfer Function models of flow in a river system
    • Second order linear systems with the predator-prey equations and a climate model as practical examples; natural frequency and damping ratio; higher order systems.
    • Linear vs. Nonlinear systems – basic introduction.

    Throughout the course case studies and examples will be used to illustrate the material. Guest lecturers maybe invited to contribute depending on availability.

    On successful completion of this module students will be able to

    • Evaluate the principles and problems of computer aided modelling of environmental systems.
    • Use contemporary industry standard numerical software for basic analysis and simulation of environmental systems.
    • Communicate with mathematicians and numerical analysts in joint projects involving modelling.
    • Understand the way in which simple mathematical concepts can be used to build models of environmental systems
    • Undertake some simple modelling tasks, to analyse experimental data and interpret the modelling outcomes

    Bibliography

    The following texts may be useful if read with discretion.

    • Young, P.C. (1993) Concise Encyclopaedia of Environmental Systems. Pergamon: Oxford (selected articles)*
    • Young, P.C., Parkinson, S. and Lees, M.J. (1996) Simplicity out of complexity: Occam’s Razor revisited* Journal of Applied Statistics, 23, 165-210
    • Young, P.C. Recursive Estimation and Time Series Analysis. An Introduction, Springer, 1984
    • Bennett, R.J., Chorley, R.J. Environmental Systems, Philosophy, Analysis and Control, Methuen 1980*
    • Hardisty, J. et al. Computerised Environmental Modelling, A practical introduction using Excel, Wiley, 1993
  • Population Health Pathway: Principles of Epidemiology
    Pathway Population Health
    Category Statistics module, available to all
    Credits 10 credits

    This course introduces the principles of epidemiology and the statistical methods applied in epidemiological studies. It also introduces important concepts related to study design and statistical modelling concepts such as confounding and mediation.

    Topics covered will include

    • The history of epidemiology and the role of statistics therein
    • Measures of health and disease including incidence and prevalence
    • Traditional approaches to controlling for confounding including matching and stratification
    • Epidemiological study design including cohort studies, case-control studies, cross-sectional studies, ecological studies
    • Making causal inferences in epidemiology including the use of directed acyclic graphs to describe confounding, collider bias, and mediation
    • Properties of parameters such as odds ratios and risk ratio including collapsibility
    • Critical appraisal of published epidemiological journal articles including an appreciation of their structure, and strengths and weaknesses

    On successful completion students will be able to

    • Define and calculate appropriate measures of disease incidence and prevalence
    • Describe the key statistical issues in the design of ecological studies, case-control studies, cohort studies, and cross-sectional studies
    • Discuss and implement strategies for dealing with confounding and mediation
    • Define and estimate important parameters such as the risk difference, risk ratio, and odds ratio
    • Discuss the strengths and weaknesses of a published epidemiological paper and summarise these for different audiences

    Bibliography

    • Clayton D. and Hills M. (1993) Statistical models in epidemiology. Oxford, Oxford University Press
    • Rothman K.J., Greenland S. and Lash T.L. (2008) Modern Epidemiology. Lippincott, Williams and Wilkins, US
  • Population Health Pathway: Environmental Epidemiology
    Pathway Population Health
    Category Statistics module, requires Statistics Modules II
    Credits 10 credits

    This course aims to introduce students to statistical methods commonly used by epidemiologists and statisticians to investigate the relationship between risk of disease and environmental factors. Specifically the course concentrates on studies with a spatial component. A number of published studies will be used to illustrate the methods described, and students will learn how to perform similar analyses using the R statistical package. By the end of the course students have an awareness of methodology used in environmental epidemiology, including an appreciation of their limitations, and should be capable of a number of these analyses themselves.

    Topics covered will include

    • Introduction: Motivating examples for methods in course
    • Spatial Point Processes: theory and methods for the analysis of point patterns in two - dimensional space
    • Clustering of disease: case-control point-based methods and methods based on counts
    • Spatial variation in risk: case-control and point-based methods; generalized additive models
    • Disease mapping: investigating variation in risk with count data
    • Geographical correlation studies: the ecological fallacy; relation with disease mapping
    • Point source methods: Investigation of risk associated with distance from a point or line source, for point and count data
    • Geostatistics: introduction to the analysis of geostatistical data. Kriging and spatial prediction

    On successful completion students will be able to

    • Define and give examples of spatial point processes; describe the first and second moments of a point process
    • Define, estimate and calculate theoretical K functions for a spatial point process
    • Test for spatial clustering of a point pattern using the K function
    • Use generalised additive models to construct smooth maps of spatial variation in disease risk and interpret key model outputs
    • Use Poisson regression to analyse area-level count data and interpret key model outputs
    • Describe what is meant by the ecological fallacy
    • Carry out simple analyses of case-control data in relation to a point source
    • Gaussian geostatistical models including a Gaussian process random effect term
    • Perform basic analyses of geostatistical data, define and interpret the variogram
    • Recognise the difference between point process data, area-level data and geostatistical data
    • Describe some practical issues involved in undertaking environmental epidemiology studies

    Bibliography

    • P.J. Diggle. Statistical Analysis of Spatial Point Patterns (2nd edition). London: Edward Arnold. 2003
    • P.Elliott, M.Martuzzi and G. Shaddick, Spatial statistical methods in environmental epidemiology: a critique. Statistical methods in Medical Research, 4, 137-159, 1995
    • P.Elliott, J. Wakefield, N. Best and D. Briggs (eds), Disease and Exposure Mapping. Oxford University Press, Oxford, 1999
    • L. Waller and C.A. Gotway. Applied Spatial Statistics for Public Health Data. New York: Wiley, 2004
  • Population Health Pathway: Modelling of Infectious Diseases
    Pathway Population Health
    Category Statistics module, available to all
    Credits 10 credits

    This module aims to provide students with the necessary knowledge, and analytical and modelling skills to develop and fit mathematical transmission models to understand infection dynamics, explore interventions, and to inform control policy. It will also provide students with the ability to analyse outbreak information, and to implement transmission models using the R programming language. Students will gain experience of handling and linking epidemiological data relevant to infectious disease outbreaks. They will gain hands-on experience of developing transmission models, appropriate to a specific research question or epidemiological application, and of using those models for scenario exploration. Students will also gain experience in communicating and presenting epidemic models and their outputs.

    Topics covered will include

    • Construction of mathematical disease models appropriate to their purpose
    • The differences between deterministic and stochastic infectious disease modelling frameworks
    • The dynamical behavior of infectious disease models
    • Statistical inference using infectious disease models
    • Analysis and critical interpretation of infectious disease data for outbreak analysis
    • Communication of disease models and interpretation of their output

    On successful completion students will be able to

    • Demonstrate a deep understanding of the role of mathematical modelling in epidemiology
    • Take a critical approach to linking sources of epidemiological data required for infectious disease models
    • Understand epidemiology of infectious disease and modelling literature
    • Interpret modelling studies critically
    • Take a responsible approach towards the use of mathematical modeling, and appreciate and the ethical and social impacts of research and practice within this subject area

    Bibliography

    • Keeling MJ, Rohani P. Modelling Infectious Diseases in Humans and Animals. Princeton University Press. 2007
    • Andersson H, Britton T. Stochastic Epidemic Models and their Statistical Analysis. Lecture Notes in Statistics. Springer. 2000
  • Population Health Pathway: Survival and Event History Analysis
    Pathway Population Health
    Category Statistics module, requires Statistics Modules II
    Credits 10 credits

    This course aims to describe the theory and to develop the practical skills required for the analysis of medical studies leading to the observation of survival times or multiple failure times. By the end of the course students should be able to carry out sophisticated analyses of this type, should be aware of the variety of statistical models and methods now available, and understand the nature and importance of the underlying model assumptions.

    In many medical applications interest lies in times to or between events. Examples include time from diagnosis of cancer to death, or times between epileptic seizures. This advanced course begins with a review of standard approaches to the analysis of possibly censored survival data. Survival models and estimation procedures are reviewed, and emphasis is placed on the underlying assumptions, how these might be evaluated through diagnostic methods and how robust the primary conclusions might be to their violation.

    The course closes with a description of models and methods for the treatment of multivariate survival data, such as repeated failures, the lifetimes of family members or competing risks. Stratified models, marginal models and frailty models are discussed.

    Topics covered will include

    • Survival data. Censoring. Survival, hazard and cumulative hazard functions. Kaplan-Meier plots. Parametric models and likelihood construction. Cox’s proportional hazards model, partial likelihood, Nelson-Aalen estimators. Survival time prediction
    • Diagnostic methods. Schoenfeld and other residuals. Testing the proportional hazards assumption. Detecting changes in covariate effects
    • Frailty models and effects. Identifiability and estimation. Competing risks. Marginal models for clustered survival data

    On successful completion students will be able to

    • Apply a range of appropriate statistical techniques to survival and event history data using statistical software
    • Accurately interpret the output of statistical analyses using survival models fitted using standard software
    • Construct and manipulate likelihood functions from parametric models for censored data
    • Identify when particular models are appropriate through the application of diagnostic checks and model building strategies

    Bibliography

    • P. Hougaard, Analysis of Multivariate Survival Data. Springer, 2000
    • T.M. Therneau and P.M. Grambsch, Modelling Survival Data: Extending the Cox Model. Springer, 2000
    • T.H. Fleming, and D.P. Harrington, Counting processes and survival analysis. Wiley, 1991
  • Population Health Pathway: Longitudinal Data Analysis
    Pathway Population Health
    Category Statistics module, requires Statistics Modules II
    Credits 10 credits

    Longitudinal data arise when a time-sequence of measurements is made on a response variable for each of a number of subjects in an experiment or observational study. For example, a patient's blood pressure may be measured daily following administration of one of several medical treatments for hypertension. The practical objective of many longitudinal studies is to find out how the average value of the response varies over time, and how this average response profile is affected by different experimental treatments. This module presents an approach to the analysis of longitudinal data, based on statistical modelling and likelihood methods of parameter estimation and hypothesis testing.

    The specific aim of this module is to teach students a modern approach to the analysis of longitudinal data. Upon completion of this course the students should have acquired, from lectures and practical classes, the ability to build statistical models for longitudinal data, and to draw valid conclusions from their models.

    Topics covered will include

    • What is longitudinal data?
    • Exploratory and simple analysis strategies
    • Normal linear model with correlated errors
    • Linear mixed effects models
    • Non-normal responses with GLMs
    • Dealing with dropout

    On Successful completion students will be able to

    • Explain the differences between longitudinal studies and cross-sectional studies
    • Select appropriate techniques to explore data
    • Compare different approaches to estimation and their usage in the analysis
    • Build statistical models for longitudinal data and to draw valid conclusions from their models
    • Express the problems arising in longitudinal studies in mathematical language
    • Use computer packages in statistical modeling and analysis of longitudinal data
    • Summarise the findings in writing and present to wider audience

    Bibliography

    • H. Brown and R. Prescott, Applied Mixed Models in Medicine, Wiley, 1999
    • P.J. Diggle, P. Heagerty, K.Y. Liang and S.L. Zeger, Analysis of Longitudinal Data (second edition), Oxford University Press, 2002
    • G.M. Fitzmaurice, N. M. Laird and J. H. Ware, Applied Longitudinal Analysis, Wiley Series in Probability and Statistics, 2004
    • G. Verbeke and G. Molenberghs, Linear Mixed Models for Longitudinal Data, Springer, 2000
    • R. E. Weiss, Modelling longitudinal data, Springer, 2005
  • Bioinformatics Pathway: Principles of Epidemiology
    Pathway Bioinformatics
    Category Statistics module, available to all
    Credits 10 credits

    This course introduces the principles of epidemiology and the statistical methods applied in epidemiological studies. It also introduces important concepts related to study design and statistical modelling concepts such as confounding and mediation.

    Topics covered will include

    • The history of epidemiology and the role of statistics therein
    • Measures of health and disease including incidence and prevalence
    • Traditional approaches to controlling for confounding including matching and stratification
    • Epidemiological study design including cohort studies, case-control studies, cross-sectional studies, ecological studies
    • Making causal inferences in epidemiology including the use of directed acyclic graphs to describe confounding, collider bias, and mediation
    • Properties of parameters such as odds ratios and risk ratio including collapsibility
    • Critical appraisal of published epidemiological journal articles including an appreciation of their structure, and strengths and weaknesses

    On successful completion students will be able to

    • Define and calculate appropriate measures of disease incidence and prevalence
    • Describe the key statistical issues in the design of ecological studies, case-control studies, cohort studies, and cross-sectional studies
    • Discuss and implement strategies for dealing with confounding and mediation
    • Define and estimate important parameters such as the risk difference, risk ratio, and odds ratio
    • Discuss the strengths and weaknesses of a published epidemiological paper and summarise these for different audiences

    Bibliography

    • Clayton D. and Hills M. (1993) Statistical models in epidemiology. Oxford, Oxford University Press
    • Rothman K.J., Greenland S. and Lash T.L. (2008) Modern Epidemiology. Lippincott, Williams and Wilkins, US
  • Bioinformatics Pathway: Design and Analysis of Clinical Trials
    Pathway Bioinformatics
    Category Statistics module, available to all
    Credits 10 credits

    This course aims to introduce students to aspects of statistics, which are important in the design and analysis of clinical trials.

    Clinical trials are planned experiments on human beings designed to assess the relative benefits of one or more forms of treatment. For instance, we might be interested in studying whether aspirin reduces the incidence of pregnancy-induced hypertension; or we may wish to assess whether a new immunosuppressive drug improves the survival rate of transplant recipients. On completion of the module students should understand the basic elements of clinical trials, be able to recognise and use principles of good study design, and be able to analyse and interpret study results to make correct scientific inferences.

    Topics covered will include

    • Clinical trials fundamentals: trial terminology, principles of sound study design and ethics
    • Defining and estimating treatment effects: continuous and binary data
    • Crossover trials: motivation, design issues and analyses
    • Sample size determination; continuous and binary data
    • Equivalence and Non-inferiority trials
    • Systematic reviews and Meta Analysis

    On successful completion students will be able to

    • Understand the basic elements of clinical trials
    • Recognise and use principles of good study design, and be able to analyse and interpret study results to make correct scientific inferences
    • Determine the different approaches that can be taken in addressing clinical questions related to the effectiveness of treatments and other types of interventions

    Bibliography

    • D.G. Altman, Practical Statistics for Medical Research, Chapman and Hall, 1991
    • S. Senn, Cross-over trials in clinical research, Wiley, 1993
    • S. Piantadosi, Clinical Trials: A Methodologic Perspective, John Wiley & Sons, 1997
    • ICH Harmonised Tripartite Guidelines
    • J.N.S. Matthews, Introduction to Randomised Controlled Clinical Trials, Arnold, 2000
  • Bioinformatics Pathway: Bioinformatics
    Pathway Bioinformatics
    Category Statistics module, available to all
    Credits 15 credits

    This course will equip students with a working knowledge of the main themes in bioinformatics. On successful completion, students should be confident and competent in all aspects of bioinformatics that can be executed via the web or on software running on Windows/Mac systems. They will have an understanding of the theoretical algorithms that underpin the various software applications that they use, and be able to perform bioinformatics within their own biological sub-field. More generally, this module also aims to encourage students to access and evaluate information from a variety of sources and to communicate the principles in a way that is well-organised, topical and recognises the limits of current hypotheses. It also aims to equip students with practical techniques including data collection, analysis and interpretation.

    Topics covered will include

    • Reading lists and how to manage reading. Doing a PubMed search
    • The foundations of Bioinformatics
    • Advanced bioinformatics I: Going deeper into algorithms
    • Advanced bioinformatics II: Structural bioinformatics
    • Advanced bioinformatics III: Phylogenetics : How do we use sequences to investigate evolution?
    • Advanced bioinformatics IV: Detecting natural selection
    • Advanced bioinformatics V: Processing deep sequencing data

    On successful completion students will be able to

    • Perform bioinformatics via the web GenBank, Pfam, Uniprot, PDB, SCOP. Use Artemis for genome visualization
    • Download and align sequences, curate sequences, derive statistics on alignments. Use DNASp for sliding window analysis
    • Building phylogenetic trees in MEGA. Use SimPlot for recombination analysis
    • Structural bioinformatics – do homology modelling via SwissModel. Use a protein sequence viewer. Use Galaxy for deep sequence assembly
    • Build a Bayesian phylogenetic tree with BEAST

    Bibliography

    • Michael Agostino. Practical Bioinformatics. Garland Science. ISBN 978-0-8153-4456-8
    • Arthur M Lesk. Introduction to Bioinformatics 4th ed. Oxford Univ Press ISBN 978-0-19-965156-6
    • Paul H Dear (ed) Bioinformatics. Scion ISBN 978-1-90-484216-3
    • Masatoshi Nei & Sudhir Kumar Molecular evolution and phylogenetics (Available on http://lib.myilibrary.com/Open.aspx?id=83437)
    • Drummond AJ & and Bouckaert RR. Bayesian Evolutionary Analysis with BEAST. Cambridge University Press ISBN 978-1107019652
  • Bioinformatics Pathway: Statistical Genetics and Genomics
    Pathway Bioinformatics
    Category Statistics module, available to all
    Credits 10 credits

    This module will give the students a working knowledge of recent statistical approaches for analyzing modern genomic and genetic datasets. The students will learn about significance testing for genetic variants using logistic regression, multiple testing correction using strategies such as Bonferroni and False discovery rate control, quantification of gene expression in RNA-seq data using expectation-maximization to determine ambiguous isoforms, differential expression testing using a negative binomial model, and Bayesian network models for gene regulation.

    Topics covered will include

    • Introduction to Molecular Biology
    • Introduction to Human Genetics Studies
    • Genome wide associations studies (QC, analysis, multiple testing correction, population stratification)
    • RNA-Seq gene expression analysis
    • Differential Gene Expression
    • Statistical Models for gene regulation

    On successful completion students will be able to

    • Discuss the key aspects of genetics and genomics
    • Define the statistical challenges in the analysis of genetics and genomics data
    • Explain Genome-Wide Association Studies (GWAS) and how to find trait markers
    • Perform a GWAS analysis and assess the significance of identified risk variants
    • Identify differentially expressed genes in RNA-seq gene expression data
    • Sketch the process of gene regulation and model it using statistical tools
    • Understand the kinds of methods used in statistical genomics and genetics, including their limitations
    • Analyse complex genetic and genomic datasets using statistical programming packages
    • Perform a literature survey of statistical applications to a novel scientific field

    Bibliography

  • Computing Pathway: Systems Architecture and Integration
    Pathway Computing
    Category Computing module, available to all
    Credits 15 credits

    In this module we explore the architectural approaches, techniques and technologies that underpin today's global IT infrastructure and particularly large-scale enterprise IT systems. It is one of two complementary modules that comprise the Systems stream of the Data Science MSc, which together provide a broad knowledge and context of systems architecture enabling students to assess new systems technologies, to know where technologies fit in the larger scheme of enterprise systems and state of the art research thinking, and to know what to read to go deeper.

    The principal ethos of the module is to focus on the principles, emergent properties and application of systems elements as used in large-scale and high performance systems. Detailed case studies and invited industrial speakers will be used to provide supporting real-world context and a basis for interactive seminar discussions.

    Topics to be covered will include

    • Systems of systems composition
    • Scalability concerns
    • Systems integration/interoperability
    • Software and Infrastructure as a Service (i.e. Cloud computing principles)

    Supported by a consideration of emerging issues and implications arising from these new technologies:

    • Commercial considerations
    • Legal and ethical considerations
    • New development and support paradigms, including open sourcing

    In addition to the discussion and seminar led aspects of the course, we envisage ‘hands-on’ measurement-based coursework that looks empirically at the scalability of a significant technology, e.g. a cloud system such as Amazon EC2.

    On successful completion of this module students will

    • Demonstrate a deep understanding of the architectures and approaches for large-scale systems implementation
    • Describe and critically evaluate techniques and paradigms used within enterprise-scale IT systems
    • Understand and appreciate the trade-offs, strengths and limitations of systems architectures in principle and practice in modern IT systems.
  • Computing Pathway: Elements of Distributed Systems
    Pathway Computing
    Category Computing module, available to all
    Credits 15 credits

    Distributed artificial intelligence is fundamental in contemporary data analysis. Large volumes of data and computation call for multiple computers in problem solving. Being able to understand and use those resources efficiently is an important skill for a data scientist. A distributed approach is also important for fault-tolerance and robustness, as the loss of a single component must not significantly compromise the whole system. Additionally, contemporary and future distributed systems go beyond computer clusters and networks. Distributed systems are often comprised of multiple agents -- multiple software, humans and/or robots that all interact in problem solving. As a data scientist, we may have control of the full distributed system, or we may have control of only one piece, and we have to decide how it must behave in face of others in order to accomplish our goals.

    Therefore, a strong data scientist must go beyond "passive" data analysis. Even a very accurate classification may become useless if it does not lead to high-performing decisions in actual problems. It is fundamental to use data to create systems that are able to behave in an intelligent manner, considering the presence of multiple actors, which may or may not be cooperative with our system. The "data" may be historical information stored in files or data-bases, as in classical machine learning; or it might be arriving continuously in an "on-line" way; or it might even be the system's own experience. All that must be used in the creation of intelligent systems.

    Hence, in this module we will study how to use multiple agents for creating powerful machine learning systems. Furthermore, we will go beyond data classification, and will study how to take intelligent decisions autonomously given the presence of multiple actors, whether they are cooperative or not.

    Topics to be covered will include

    • Fundamental concepts of multi-agent systems
    • Local coordination rules and emergence
    • Ensemble Systems
    • Decision Theory and Game Theory
    • On-line learning
    • Multi-agent Reinforcement Learning

    On successful completion of this module students will be able to

    • Understand the difference between single and multi-agent artificial intelligence; including the advantages and challenges of distribution
    • Use a computer cluster for experimental work and data analysis
    • Solve problems by using loose control, where local individual behaviour leads to complex self-organised systems
    • Design systems that intelligently interact with others -- including those outside their control
    • Design systems that learn from their own experience in an on-line way
    • Improve classification/prediction performance by intelligently using multiple algorithms
    • Read and critique research papers

    Bibliography

    The bibliography consists of research papers and course notes, which will be available during the course.

Term 3: Dissertation and Work Placement

In the third term, you will complete a substantial dissertation project. The majority of our students base their dissertations on 12-week placement projects hosted by our partner organisations. Host organisations will generally provide you with a bursary of £3,000. Placement projects are usually based on a data science challenge relevant to the host’s activities and will give you the opportunity to gain real-world experience as a data scientist.

You will be supported throughout the projects by an academic supervisor and by a company supervisor, who is typically an established data scientist.

We have arranged placement projects for over 130 students over the last 4 years of our programme. Our students have undertaken projects at many leading companies including Boots, Fujitsu, The Co-operative Group and AstraZeneca.

If you do not choose to work in a partner organisation, you also have the opportunity to base your dissertation on either:

  • a research project based at the University -if you are interested in building a career in data science research
  • an enterprise project - if you are interested in starting your own data science business.

If you would like to apply for one of our Data Science Masters degrees, you need to use Lancaster University's My Applications website.

Student Placement Experiences