Term 1 provides core data scientific knowledge and skills training and is divided into five study modules, worth a total of 75 credits  15 credits per module. You will study three Common Core data science modules that are compulsory, together with two Core statistics modules that are Specialismspecific and tailored according to your academic background  Statistics I or Statistics II.
Term 1: Core Modules
Common Core Modules

Common Core Modules: Data Science Fundamentals SCC460
This module teaches students about how data science is performed within academic and industry (via invited talks), research methods and how different research strategies are applied across different disciplines, and data science techniques for processing and analysing data. Students will engage in group project work, based on project briefs provided by industrial speakers, within multiskilled teams (e.g. computing students, statistics students, environmental science students) in order to apply their data science skills to researching and solving an industrial data science problem.
Topics covered will include
 The role of the data scientist and the evolving epistemology of data science
 The language of research, how to form research questions, writing literature reviews, and variance of research strategies across disciplines
 Ethics surrounding data collection and resharing, and unwanted inferences
 Identifying potential data sources and the data acquisition processes
 Defining and quantifying biases, and data preparation (e.g. cleaning, standardisation, etc.)
 Choosing a potential model for data, understanding model requirements and constraints, specifying model properties a priori, and fitting models
 Inspection of data and results using plots, and hypothesis and significance tests
 Writing up and presenting findings
Learning
Students will learn through a series of group exercises around research studies and projects related to data science topics. Invited talks from industry tackling data science problems will be given to teach the students about the application of data science skills in industry and academia. Students will gain knowledge of:
 Defining a research question and a hypothesis to be tested, and choosing an appropriate research strategy to test that hypothesis
 Analysing datasets provided in heterogeneous forms using a range of statistical techniques
 How to relate potential data sources to a given research question, acquire such data and integrate it together
 Designing and performing appropriate experiments given a research question
 Implementing appropriate models for experiments and ensuring that the model is tested in the correct manner
 Analysing experimental findings and relating these findings back to the original research goal
Recommended texts and other learning resources
 O'Neil. C., and Schutt. R. (2013) Doing Data Science: Straight Talk from the Frontline. O’Reilly
 Trochim. W. (2006) The Research Methods Knowledge Base. Cenage Learning

Common Core Modules: Programming for Data Scientists SCC461
This module is designed for students that are completely new to programming, and for experienced programmers, bringing them both to a highskilled level to handle complex data science problems. Beginner students will learn the fundamentals of programming, while experienced students will have the opportunity to sharpen and further develop their programming skills. The students are going to learn dataprocessing techniques, including visualisation and statistical data analysis. For a broad formation, in order to handle the most complex data science tasks, we will also cover problem solving, and the development of graphical applications.
In particular students will gain experience with two very important open source languages: R and Python. R is the best language for statistical analysis, being widely applied in academia and industry to handle a variety of different problems. Being able to program in R gives the data scientists access to the best and most updated libraries for handling a variety of classical and state of the art statistical methods. Python, on the other hand, is a general purpose programming language, also widely used for three main reasons: it is easy to learn, being recommended as a "first" programming language; it allows easy and quick development of applications; it has a great variety of useful and open libraries. For those reasons, Python has also been widely applied for scientific computing and data analysis. Additionally, Python enables the data scientist to easily develop other kinds of useful applications: for example, searching for optimal decisions given a dataset, graphical applications for data gathering, or even programming Raspberry Pi devices in order to create sensors or robots for data collection. Therefore, learning these two languages will not only enable the students to develop programming skills, but it will also give them direct access to two fundamental languages for contemporary data analysis, scientific computing, and general programming.
Additionally, students will gain experience by working through exercise tasks and discussing their work with their peers; thereby fostering interpersonal communications skills. Students that are new to programming will find help in their experience peers, and experienced programmers will learn how to assist and explain the fundamental concepts to beginners.
Topics covered will include
 Fundamental programming concepts (statements, variables, functions, loops, etc)
 Data abstraction (modules, classes, objects, etc)
 Problemsolving
 Using libraries for developing applications (e.g., SciPy, PyGames)
 Performing statistical analysis and data visualisation
On successful completion of this module students will be able to
 Solve data science problems in an automatic fashion
 Handle complex datasets, which cannot be easily analysed "by hand"
 Use existing libraries and/or develop their own libraries
 Learn new programming languages, given the background knowledge of two important ones
Bibliography
 Introductory statistics with R. Dalgaard, Peter. Springer, 2008. ISBN13: 9780387954752
 R Cookbook. Paul Teetor. O'Reilly Media; 1 edition. 2011. ISBN13: 9780596809157.
 Python Documentation: https://www.python.org/doc/
 SciPy Documentation: https://www.scipy.org/docs.html
 PyGames Documentation: https://www.pygame.org/docs/

Common Core Modules: Data Mining SCC403
This module will provide a comprehensive coverage of the problems related to Data representation, storage, manipulation, retrieval and processing in terms of extracting information from the data. It has been designed to provide a fundamental theoretical level of knowledge and skills (at the related laboratory sessions) to this specific aspect of Data Science, which plays an important role in any system and application. In this way it prepares students for the second module on the topic of Data as well as for their projects.
Topics to be covered will include
 Data Primer: Setting the scene: Big Data, Cloud Computing; The time, storage and computing power compromise: offline versus online
 Data Representations
 Storage Paradigms
 Vectorspace models
 Hierarchical (agglomerative/diversive)
 k means
 SQL and Relational Data Structures (short refresher)
 NoSQL: Document stores, graph databases
 Inference and reasoning
 Associative and Fuzzy Rules
 Inference mechanisms
 Data Processing
 Clustering
 Densitybased, online, evolving
 Classification
 Randomness and determinism, frequentist and belief based approaches, probability density, recursive density estimation, averages and moments, important random signals, response of linear systems to random signals, random signal models
 Discriminative (Linear Discriminant Analysis, Single Perceptron, Multilayer Perceptron, Learning Vector Classifier, Support Vector Machines), Generative (Naive Bayes)
 Supervised and unsupervised learning, online and offline systems, adaptive and evolving systems, evolving versus evolutionary systems, normalisation and standardisation
 Fuzzy Rulebased Classifiers, Regression or Lable based classifiers
 Selflearning Classifiers, evolving Classifiers, dynamic data space partitioning using evolving clustering and data clouds, monitoring the quality of the selflearning system online, evolving multimodel predictive systems
 Semisupervised Learning (Selflearning, evolving, Bootstrapping, ExpectationMaximisation, ensemble classifiers)
 Information Extraction vs Retrieval
On successful completion of this module students will
 Demonstrate understanding of the concepts and specific methodologies for data representation and processing and their applications to practical problems
 Analyse and synthesise effective methods and algorithms for data representation and processing
 Develop software scripts that implement advanced data representation and processing and demonstrate their impact on the performance
 List, explain and generalise the tradeoffs of performance and complexity in designing practical solutions for problems of data representation and processing in terms of storage, time and computing power
Statistics Modules
Statistics Modules I is for students with a degree in Mathematics and/or Statistics. Statistics Modules II is for students with Alevel Mathematics or equivalent.

Statistics Modules I: Statistical Methods and Modelling CFAS440
The aim of this module will be to address the fundamentals of statistics for those who do not have a mathematics and statistics undergraduate degree. Building upon the prelearning ‘mathematics for statistics’ module is delivered over five weeks via a series of lectures and practical’s. Students will develop an understanding of the theory behind core statistical topics; sampling, hypothesis testing, and modelling. They will also be putting this knowledge into practice, by applying it to real data to address research questions.
The module is an amalgamation of three short courses and is taught in weeks 15.
Topics covered will include
 Statistical Methods; commonly used probability distributions, parameter estimation, sampling variability, hypothesis testing, basic measures of bivariate relationships
 Generalised Linear Models; the general linear model and the leastsquares method, logistic regression for binary responses, Poisson regression for count data. More broadly, how to build a flexible linear predictor to capture relationships of interest
 These short courses are supported by tutorial sessions and office hours
On Successful completion, students will be able to
 Comprehend the mathematical notation used in explaining probability and statistics
 Demonstrate knowledge of basic principles in probability, statistical distributions, sampling and estimation
 Make decisions on the appropriate way to test a hypothesis, carry out the test and interpret the results
 Demonstrate knowledge of the general linear model, the leastsquares method of estimation, and the linear predictor. As well the extensions to generalised linear models for discrete data
 Decide on the appropriate way to statistically address a research question
 Carry out said statistical analyses, assessing model results and performance
 Report their findings in context
Assessment
There will be three pieces of coursework:
 One assessment for Statistical Methods; assessing understand and application of statistical concepts, and interpretation of results from hypothesis testing.
 Two independently produced reports for Generalized Linear Models; centred on indepth statistical analyses.
Bibliography
 Upton, G., & Cook, I. (1996). Understanding statistics. Oxford University Press
 Rice, J. (2006). Mathematical statistics and data analysis. Cengage Learning
 Dobson, A. J., & Barnett, A. G. (2008). An Introduction to Generalized Linear Models. CRC Press
 Fox, J. (2008). Applied regression analysis and generalized linear models. Sage Publications

Statistics Modules I: Statistical Inference CFAS406
This modules aims to provide an indepth understanding of statistics as a general approach to the problem of making valid inferences about relationships from observational and experimental studies. The emphasis will be on the principle of Maximum Likelihood as a unifying theory for estimating parameters. The module is delivered as a combination of lectures and practical’s over four week.
Topics covered will include
 Revision of probability theory and parametric statistical models
 The properties of statistical hypothesis tests, statistical estimation and sampling distributions
 Maximum Likelihood Estimation of model parameters
 Asymptotic distributions of the maximum likelihood estimator and associated statistics for use in hypothesis testing
 Application of likelihood inference to simple statistical analyses including linear regression and contingency tables
Learning
Students will learn through the application of concepts and techniques covered in the module by application to real data sets. Students will be encouraged to examine issues of substantive interest in these studies. Students will acquire knowledge of:
 Application of likelihood inference to simple statistical analyses including linear regression
 The basic principles of probability theory
 Maximum Likelihood as a theory for estimation and inference
 The application of the methodology to hypothesis testing for model
Students will, more generally, develop skills to
 apply theoretical concepts
 identify and solve problems
Bibliography
 Dobson, A. J. (1983). An Introduction to Statistical Modelling. Chapman and Hall
 Eliason, S. R. (1993). Maximum Likelihood Estimation: Logic and Practice. Sage Publications
 Pickles, A. (1984). An introduction to likelihood analysis. Geo Books
 Pawitan, Y. (2001). In all likelihood: statistical modelling and inference using likelihood. Oxford University Press.

Statistics Modules II: Generalised Linear Models MATH552
Generalised linear models are now one of the most frequently used statistical tools of the applied statistician. They extend the ideas of regression analysis to a wider class of problems that involves exploring the relationship between a response and one or more explanatory variables. In this course we aim to discuss applications of the generalised linear models to diverse range of practical problems involving data from the area of biology, social sciences and time series to name a few and to explore the theoretical basis of these models.
Topics covered will include
 We introduce a large family of models, called the generalised linear models (GLMs), that includes the standard linear regression model as a special case and we discuss the theory and application of these models
 We learn an algorithm called iteratively reweighted least squares algorithm for the estimation of parameters
 Formulation of sensible models for relationship between a response and one or more explanatory variables, taking into account of the motivation for data collection
 We fit and check these models with the statistical package R; produce confidence intervals and tests corresponding to questions of interest; and state conclusions in everyday language
On successful completion students will be able to
 Define the components of GLM
 Express standard models (Gaussian (Normal), Poisson,…) in GLM form
 Derive relationships between mean and variance and parameters of an exponential family distribution
 Specify design matrices for given problems
 Define and interpret model deviance and degrees of freedom
 Use model deviances to assist in model selection
 Define deviance and Pearson residuals, and understand how to use them for checking model quality
 Use R to fit standard (and appropriate) GLM’s to data
 Understand and interpret R output for model selection and diagnosis, and draw appropriate scientific conclusions
Bibliography
 P. McCullagh and J. Nelder. Generalized Linear Models, Chapman and Hall, 1999
 A.J. Dobson, An Introduction to Generalised Linear Models, Chapman and Hall, 1990

Statistics Modules II: Likelihood Inference MATH551
This course considers the idea of statistical models and how the likelihood function, defined to be the probability of the observed data viewed as a function of unknown model parameters, can be used to make inference about those parameters. This inference includes both estimates of the values of these parameters, and measures of the uncertainty surrounding these estimates. We consider single and multiparameter models, and models which do not assume the data are independent and identically distributed. We also cover computational aspects of likelihood inference that are required in many practical applications, including numerical optimization of likelihood functions and bootstrap methods to estimate uncertainty.
Topics covered will include
 Definition of the likelihood function for single and multiparameter models, and how it is used to calculate point estimates (maximum likelihood estimates)
 Asymptotic distribution of the maximum likelihood estimator, and the profile deviance, and how these are used to quantify uncertainty in estimates
 Interrelationships between parameters, and the definition and use of orthogonality
 Generalised likelihood ratio statistics, and their use for hypothesis tests
 Calculating likelihood functions for nonIID models
 Use of computational methods in R to calculate maximum likelihood estimates and confidence intervals; perform hypothesis tests and calculate bootstrap confidence intervals
On successful completion students will be able to
 Understand how to construct statistical models for simple applications
 Appreciate how information about the unknown parameters is obtained and summarized via the likelihood function
 Calculate the likelihood function for basic statistical models
 Evaluate point estimates and make statements about the variability of these estimates
 Understand the interrelationships between parameters, and the concept of orthogonality
 Perform hypothesis tests using the generalised likelihood ratio statistic
 Use computational methods to calculate maximum likelihood estimates
 Use computational methods to construct both likelihoodbased and bootstrapped confidence intervals, and perform hypothesis tests
Bibliography
 A Azzalini. Statistical Inference: Based on the Likelihood. Chapman and Hall. 1996
 D R Cox and D V Hinkley. Theoretical Statistics. Chapman and Hall. 1974
 Y Pawitan. In All Likelihood: Statistical Modeling and Inference Using Likelihood. OUP. 2001