Students at Lancaster University

Term 1: Core Modules

Term 1 provides core data scientific knowledge and skills training and is divided into five study modules, worth a total of 75 credits - 15 credits per module. You will study three Common Core data science modules that are compulsory, together with two Core statistics modules that are Specialism-specific and tailored according to your academic background - Statistics I or Statistics II.

Common Core Modules

  • Common Core Modules: Data Science Fundamentals SCC460

    This module teaches students about how data science is performed within academic and industry (via invited talks), research methods and how different research strategies are applied across different disciplines, and data science techniques for processing and analysing data. Students will engage in group project work, based on project briefs provided by industrial speakers, within multi-skilled teams (e.g. computing students, statistics students, environmental science students) in order to apply their data science skills to researching and solving an industrial data science problem.

    Topics covered will include

    • The role of the data scientist and the evolving epistemology of data science
    • The language of research, how to form research questions, writing literature reviews, and variance of research strategies across disciplines
    • Ethics surrounding data collection and re-sharing, and unwanted inferences
    • Identifying potential data sources and the data acquisition processes
    • Defining and quantifying biases, and data preparation (e.g. cleaning, standardisation, etc.)
    • Choosing a potential model for data, understanding model requirements and constraints, specifying model properties a priori, and fitting models
    • Inspection of data and results using plots, and hypothesis and significance tests
    • Writing up and presenting findings

    Learning

    Students will learn through a series of group exercises around research studies and projects related to data science topics. Invited talks from industry tackling data science problems will be given to teach the students about the application of data science skills in industry and academia. Students will gain knowledge of:

    • Defining a research question and a hypothesis to be tested, and choosing an appropriate research strategy to test that hypothesis
    • Analysing datasets provided in heterogeneous forms using a range of statistical techniques
    • How to relate potential data sources to a given research question, acquire such data and integrate it together
    • Designing and performing appropriate experiments given a research question
    • Implementing appropriate models for experiments and ensuring that the model is tested in the correct manner
    • Analysing experimental findings and relating these findings back to the original research goal

    Recommended texts and other learning resources

    • O'Neil. C., and Schutt. R. (2013) Doing Data Science: Straight Talk from the Frontline. O’Reilly
    • Trochim. W. (2006) The Research Methods Knowledge Base. Cenage Learning
  • Common Core Modules: Programming for Data Scientists SCC461

    This module is designed for students that are completely new to programming, and for experienced programmers, bringing them both to a high-skilled level to handle complex data science problems. Beginner students will learn the fundamentals of programming, while experienced students will have the opportunity to sharpen and further develop their programming skills. The students are going to learn data-processing techniques, including visualisation and statistical data analysis. For a broad formation, in order to handle the most complex data science tasks, we will also cover problem solving, and the development of graphical applications.

    In particular students will gain experience with two very important open source languages: R and Python. R is the best language for statistical analysis, being widely applied in academia and industry to handle a variety of different problems. Being able to program in R gives the data scientists access to the best and most updated libraries for handling a variety of classical and state of the art statistical methods. Python, on the other hand, is a general purpose programming language, also widely used for three main reasons: it is easy to learn, being recommended as a "first" programming language; it allows easy and quick development of applications; it has a great variety of useful and open libraries. For those reasons, Python has also been widely applied for scientific computing and data analysis. Additionally, Python enables the data scientist to easily develop other kinds of useful applications: for example, searching for optimal decisions given a data-set, graphical applications for data gathering, or even programming Raspberry Pi devices in order to create sensors or robots for data collection. Therefore, learning these two languages will not only enable the students to develop programming skills, but it will also give them direct access to two fundamental languages for contemporary data analysis, scientific computing, and general programming.

    Additionally, students will gain experience by working through exercise tasks and discussing their work with their peers; thereby fostering interpersonal communications skills. Students that are new to programming will find help in their experience peers, and experienced programmers will learn how to assist and explain the fundamental concepts to beginners.

    Topics covered will include

    • Fundamental programming concepts (statements, variables, functions, loops, etc)
    • Data abstraction (modules, classes, objects, etc)
    • Problem-solving
    • Using libraries for developing applications (e.g., SciPy, PyGames)
    • Performing statistical analysis and data visualisation

    On successful completion of this module students will be able to

    • Solve data science problems in an automatic fashion
    • Handle complex data-sets, which cannot be easily analysed "by hand"
    • Use existing libraries and/or develop their own libraries
    • Learn new programming languages, given the background knowledge of two important ones

    Bibliography

  • Common Core Modules: Data Mining SCC403

    This module will provide a comprehensive coverage of the problems related to Data representation, storage, manipulation, retrieval and processing in terms of extracting information from the data. It has been designed to provide a fundamental theoretical level of knowledge and skills (at the related laboratory sessions) to this specific aspect of Data Science, which plays an important role in any system and application. In this way it prepares students for the second module on the topic of Data as well as for their projects.

    Topics to be covered will include

    • Data Primer: Setting the scene: Big Data, Cloud Computing; The time, storage and computing power compromise: off-line versus on-line
    • Data Representations
    • Storage Paradigms
    • Vector-space models
    • Hierarchical (agglomerative/diversive)
    • k means
    • SQL and Relational Data Structures (short refresher)
    • NoSQL: Document stores, graph databases
    • Inference and reasoning
    • Associative and Fuzzy Rules
    • Inference mechanisms
    • Data Processing
    • Clustering
    • Density-based, on-line, evolving
    • Classification
    • Randomness and determinism, frequentist and belief based approaches, probability density, recursive density estimation, averages and moments, important random signals, response of linear systems to random signals, random signal models
    • Discriminative (Linear Discriminant Analysis, Single Perceptron, Multi-layer Perceptron, Learning Vector Classifier, Support Vector Machines), Generative (Naive Bayes)
    • Supervised and unsupervised learning, online and offline systems, adaptive and evolving systems, evolving versus evolutionary systems, normalisation and standardisation
    • Fuzzy Rule-based Classifiers, Regression or Lable based classifiers
    • Self-learning Classifiers, evolving Classifiers, dynamic data space partitioning using evolving clustering and data clouds, monitoring the quality of the self-learning system online, evolving multi-model predictive systems
    • Semi-supervised Learning (Self-learning, evolving, Bootstrapping, Expectation-Maximisation, ensemble classifiers)
    • Information Extraction vs Retrieval

    On successful completion of this module students will

    • Demonstrate understanding of the concepts and specific methodologies for data representation and processing and their applications to practical problems
    • Analyse and synthesise effective methods and algorithms for data representation and processing
    • Develop software scripts that implement advanced data representation and processing and demonstrate their impact on the performance
    • List, explain and generalise the trade-offs of performance and complexity in designing practical solutions for problems of data representation and processing in terms of storage, time and computing power

Statistics Modules

Statistics Modules I is for students with a degree in Mathematics and/or Statistics. Statistics Modules II is for students with A-level Mathematics or equivalent.

  • Statistics Modules I: Statistical Methods and Modelling CFAS440

    The aim of this module will be to address the fundamentals of statistics for those who do not have a mathematics and statistics undergraduate degree. Building upon the pre-learning ‘mathematics for statistics’ module is delivered over five weeks via a series of lectures and practical’s. Students will develop an understanding of the theory behind core statistical topics; sampling, hypothesis testing, and modelling. They will also be putting this knowledge into practice, by applying it to real data to address research questions.

    The module is an amalgamation of three short courses and is taught in weeks 1-5.

    Topics covered will include

    • Statistical Methods; commonly used probability distributions, parameter estimation, sampling variability, hypothesis testing, basic measures of bivariate relationships
    • Generalised Linear Models; the general linear model and the least-squares method, logistic regression for binary responses, Poisson regression for count data. More broadly, how to build a flexible linear predictor to capture relationships of interest
    • These short courses are supported by tutorial sessions and office hours

    On Successful completion, students will be able to

    • Comprehend the mathematical notation used in explaining probability and statistics
    • Demonstrate knowledge of basic principles in probability, statistical distributions, sampling and estimation
    • Make decisions on the appropriate way to test a hypothesis, carry out the test and interpret the results
    • Demonstrate knowledge of the general linear model, the least-squares method of estimation, and the linear predictor. As well the extensions to generalised linear models for discrete data
    • Decide on the appropriate way to statistically address a research question
    • Carry out said statistical analyses, assessing model results and performance
    • Report their findings in context

    Assessment

    There will be three pieces of coursework:

    • One assessment for Statistical Methods; assessing understand and application of statistical concepts, and interpretation of results from hypothesis testing.
    • Two independently produced reports for Generalized Linear Models; centred on in-depth statistical analyses.

    Bibliography

    • Upton, G., & Cook, I. (1996). Understanding statistics. Oxford University Press
    • Rice, J. (2006). Mathematical statistics and data analysis. Cengage Learning
    • Dobson, A. J., & Barnett, A. G. (2008). An Introduction to Generalized Linear Models. CRC Press
    • Fox, J. (2008). Applied regression analysis and generalized linear models. Sage Publications
  • Statistics Modules I: Statistical Inference CFAS406

    This modules aims to provide an in-depth understanding of statistics as a general approach to the problem of making valid inferences about relationships from observational and experimental studies. The emphasis will be on the principle of Maximum Likelihood as a unifying theory for estimating parameters. The module is delivered as a combination of lectures and practical’s over four week.

    Topics covered will include

    • Revision of probability theory and parametric statistical models
    • The properties of statistical hypothesis tests, statistical estimation and sampling distributions
    • Maximum Likelihood Estimation of model parameters
    • Asymptotic distributions of the maximum likelihood estimator and associated statistics for use in hypothesis testing
    • Application of likelihood inference to simple statistical analyses including linear regression and contingency tables

    Learning

    Students will learn through the application of concepts and techniques covered in the module by application to real data sets. Students will be encouraged to examine issues of substantive interest in these studies. Students will acquire knowledge of:

    • Application of likelihood inference to simple statistical analyses including linear regression
    • The basic principles of probability theory
    • Maximum Likelihood as a theory for estimation and inference
    • The application of the methodology to hypothesis testing for model

    Students will, more generally, develop skills to

    • apply theoretical concepts
    • identify and solve problems

    Bibliography

    • Dobson, A. J. (1983). An Introduction to Statistical Modelling. Chapman and Hall
    • Eliason, S. R. (1993). Maximum Likelihood Estimation: Logic and Practice. Sage Publications
    • Pickles, A. (1984). An introduction to likelihood analysis. Geo Books
    • Pawitan, Y. (2001). In all likelihood: statistical modelling and inference using likelihood. Oxford University Press.
  • Statistics Modules II: Generalised Linear Models MATH552

    Generalised linear models are now one of the most frequently used statistical tools of the applied statistician. They extend the ideas of regression analysis to a wider class of problems that involves exploring the relationship between a response and one or more explanatory variables. In this course we aim to discuss applications of the generalised linear models to diverse range of practical problems involving data from the area of biology, social sciences and time series to name a few and to explore the theoretical basis of these models.

    Topics covered will include

    • We introduce a large family of models, called the generalised linear models (GLMs), that includes the standard linear regression model as a special case and we discuss the theory and application of these models
    • We learn an algorithm called iteratively reweighted least squares algorithm for the estimation of parameters
    • Formulation of sensible models for relationship between a response and one or more explanatory variables, taking into account of the motivation for data collection
    • We fit and check these models with the statistical package R; produce confidence intervals and tests corresponding to questions of interest; and state conclusions in everyday language

    On successful completion students will be able to

    • Define the components of GLM
    • Express standard models (Gaussian (Normal), Poisson,…) in GLM form
    • Derive relationships between mean and variance and parameters of an exponential family distribution
    • Specify design matrices for given problems
    • Define and interpret model deviance and degrees of freedom
    • Use model deviances to assist in model selection
    • Define deviance and Pearson residuals, and understand how to use them for checking model quality
    • Use R to fit standard (and appropriate) GLM’s to data
    • Understand and interpret R output for model selection and diagnosis, and draw appropriate scientific conclusions

    Bibliography

    • P. McCullagh and J. Nelder. Generalized Linear Models, Chapman and Hall, 1999
    • A.J. Dobson, An Introduction to Generalised Linear Models, Chapman and Hall, 1990
  • Statistics Modules II: Likelihood Inference MATH551

    This course considers the idea of statistical models and how the likelihood function, defined to be the probability of the observed data viewed as a function of unknown model parameters, can be used to make inference about those parameters. This inference includes both estimates of the values of these parameters, and measures of the uncertainty surrounding these estimates. We consider single and multi-parameter models, and models which do not assume the data are independent and identically distributed. We also cover computational aspects of likelihood inference that are required in many practical applications, including numerical optimization of likelihood functions and bootstrap methods to estimate uncertainty.

    Topics covered will include

    • Definition of the likelihood function for single and multi-parameter models, and how it is used to calculate point estimates (maximum likelihood estimates)
    • Asymptotic distribution of the maximum likelihood estimator, and the profile deviance, and how these are used to quantify uncertainty in estimates
    • Inter-relationships between parameters, and the definition and use of orthogonality
    • Generalised likelihood ratio statistics, and their use for hypothesis tests
    • Calculating likelihood functions for non-IID models
    • Use of computational methods in R to calculate maximum likelihood estimates and confidence intervals; perform hypothesis tests and calculate bootstrap confidence intervals

    On successful completion students will be able to

    • Understand how to construct statistical models for simple applications
    • Appreciate how information about the unknown parameters is obtained and summarized via the likelihood function
    • Calculate the likelihood function for basic statistical models
    • Evaluate point estimates and make statements about the variability of these estimates
    • Understand the inter-relationships between parameters, and the concept of orthogonality
    • Perform hypothesis tests using the generalised likelihood ratio statistic
    • Use computational methods to calculate maximum likelihood estimates
    • Use computational methods to construct both likelihood-based and bootstrapped confidence intervals, and perform hypothesis tests

    Bibliography

    • A Azzalini. Statistical Inference: Based on the Likelihood. Chapman and Hall. 1996
    • D R Cox and D V Hinkley. Theoretical Statistics. Chapman and Hall. 1974
    • Y Pawitan. In All Likelihood: Statistical Modeling and Inference Using Likelihood. OUP. 2001