Term 1
In term 1 you will develop your core data scientific knowledge and skills. It is divided into five compulsory modules which span the breadth of data science, from the fundamentals of statistics and programming in Python and R, to modern machine learning and artificial intelligence. These core modules provide a strong foundation for the specialist pathways in term 2.
Core modules
Accordion
-
Data Science Fundamentals
This module teaches students about how data science is performed within academic and industry (via invited talks), research methods and how different research strategies are applied across different disciplines, and data science techniques for processing and analysing data. Students will engage in group project work, based on project briefs provided by industrial speakers, within multi-skilled teams (e.g. computing students, statistics students, environmental science students) in order to apply their data science skills to researching and solving an industrial data science problem.
Topics covered will include
- The role of the data scientist and the evolving epistemology of data science
- The language of research, how to form research questions, writing literature reviews, and variance of research strategies across disciplines
- Ethics surrounding data collection and re-sharing, and unwanted inferences
- Identifying potential data sources and the data acquisition processes
- Defining and quantifying biases, and data preparation (e.g. cleaning, standardisation, etc.)
- Choosing a potential model for data, understanding model requirements and constraints, specifying model properties a priori, and fitting models
- Inspection of data and results using plots, and hypothesis and significance tests
- Writing up and presenting findings
Learning
Students will learn through a series of group exercises around research studies and projects related to data science topics. Invited talks from industry tackling data science problems will be given to teach the students about the application of data science skills in industry and academia. Students will gain knowledge of:
- Defining a research question and a hypothesis to be tested, and choosing an appropriate research strategy to test that hypothesis
- Analysing datasets provided in heterogeneous forms using a range of statistical techniques
- How to relate potential data sources to a given research question, acquire such data and integrate it together
- Designing and performing appropriate experiments given a research question
- Implementing appropriate models for experiments and ensuring that the model is tested in the correct manner
- Analysing experimental findings and relating these findings back to the original research goal
Recommended texts and other learning resources
- O'Neil. C., and Schutt. R. (2013) Doing Data Science: Straight Talk from the Frontline. O’Reilly
- Trochim. W. (2006) The Research Methods Knowledge Base. Cenage Learning
-
Programming for Data Scientists
This module is designed for students that are completely new to programming, and for experienced programmers, bringing them both to a high-skilled level to handle complex data science problems. Beginner students will learn the fundamentals of programming, while experienced students will have the opportunity to sharpen and further develop their programming skills. The students are going to learn data-processing techniques, including visualisation and statistical data analysis. For a broad formation, in order to handle the most complex data science tasks, we will also cover problem solving, and the development of graphical applications.
In particular students will gain experience with two very important open source languages: R and Python. R is the best language for statistical analysis, being widely applied in academia and industry to handle a variety of different problems. Being able to program in R gives the data scientists access to the best and most updated libraries for handling a variety of classical and state of the art statistical methods. Python, on the other hand, is a general purpose programming language, also widely used for three main reasons: it is easy to learn, being recommended as a "first" programming language; it allows easy and quick development of applications; it has a great variety of useful and open libraries. For those reasons, Python has also been widely applied for scientific computing and data analysis. Additionally, Python enables the data scientist to easily develop other kinds of useful applications: for example, searching for optimal decisions given a data-set, graphical applications for data gathering, or even programming Raspberry Pi devices in order to create sensors or robots for data collection. Therefore, learning these two languages will not only enable the students to develop programming skills, but it will also give them direct access to two fundamental languages for contemporary data analysis, scientific computing, and general programming.
Additionally, students will gain experience by working through exercise tasks and discussing their work with their peers; thereby fostering interpersonal communications skills. Students that are new to programming will find help in their experienced peers, and experienced programmers will learn how to assist and explain the fundamental concepts to beginners.
Topics covered will include
- Fundamental programming concepts (statements, variables, functions, loops, etc)
- Data abstraction (modules, classes, objects, etc)
- Problem-solving
- Using libraries for developing applications (e.g., SciPy, PyGames)
- Performing statistical analysis and data visualisation
On successful completion of this module, students will be able to
- Solve data science problems in an automatic fashion
- Handle complex data-sets, which cannot be easily analysed "by hand"
- Use existing libraries and/or develop their own libraries
- Learn new programming languages, given the background knowledge of two important ones
Bibliography
- Introductory statistics with R. Dalgaard, Peter. Springer, 2008. ISBN-13: 978-0387954752
- R Cookbook. Paul Teetor. O'Reilly Media; 1 edition. 2011. ISBN-13: 978-0596809157.
- Python Documentation: https://www.python.org/doc/
- SciPy Documentation: https://www.scipy.org/
- PyGames Documentation: https://www.pygame.org/docs/
-
Data Mining
This module will provide a comprehensive coverage of the problems related to Data representation, storage, manipulation, retrieval and processing in terms of extracting information from the data. It has been designed to provide a fundamental theoretical level of knowledge and skills (at the related laboratory sessions) to this specific aspect of Data Science, which plays an important role in any system and application. In this way it prepares students for the second module on the topic of Data as well as for their projects.
Topics to be covered will include
- Data Primer: Setting the scene: Big Data, Cloud Computing; The time, storage and computing power compromise: off-line versus on-line
- Data Representations
- Storage Paradigms
- Vector-space models
- Hierarchical (agglomerative/diversive)
- k means
- SQL and Relational Data Structures (short refresher)
- NoSQL: Document stores, graph databases
- Inference and reasoning
- Associative and Fuzzy Rules
- Inference mechanisms
- Data Processing
- Clustering
- Density-based, on-line, evolving
- Classification
- Randomness and determinism, frequentist and belief based approaches, probability density, recursive density estimation, averages and moments, important random signals, response of linear systems to random signals, random signal models
- Discriminative (Linear Discriminant Analysis, Single Perceptron, Multi-layer Perceptron, Learning Vector Classifier, Support Vector Machines), Generative (Naive Bayes)
- Supervised and unsupervised learning, online and offline systems, adaptive and evolving systems, evolving versus evolutionary systems, normalisation and standardisation
- Fuzzy Rule-based Classifiers, Regression or Lable based classifiers
- Self-learning Classifiers, evolving Classifiers, dynamic data space partitioning using evolving clustering and data clouds, monitoring the quality of the self-learning system online, evolving multi-model predictive systems
- Semi-supervised Learning (Self-learning, evolving, Bootstrapping, Expectation-Maximisation, ensemble classifiers)
- Information Extraction vs Retrieval
On successful completion of this module students will
- Demonstrate understanding of the concepts and specific methodologies for data representation and processing and their applications to practical problems
- Analyse and synthesise effective methods and algorithms for data representation and processing
- Develop software scripts that implement advanced data representation and processing and demonstrate their impact on the performance
- List, explain and generalise the trade-offs of performance and complexity in designing practical solutions for problems of data representation and processing in terms of storage, time and computing power
-
Statistical Learning
This module provides an introduction to statistical learning.
Topics to be covered will include
- Big data
- Missing data
- Biased samples and recency
- Likelihood and cross-validation
On successful completion of this module students will
- Understand cross-validation of sample splitting into calibration, training and validation samples.
- Be able to move to handling regression problems for large data sets via variable reduction methods such as the Lasso and Elastic Net.
- Understand a variety of classification methods including logistic and multinomial logistic models, regression trees, random forests and bagging and boosting.
- Examine classification methods that will culminate in neural networks presented as generalised linear modelling extensions.
- Understand big data using K-means, PAM and CLARA, followed by mixture models and latent class analysis.
-
Statistical Fundamentals I
This module provides an introduction, at graduate level, to two core areas which are essential building blocks to further advanced study of statistical modelling, methodology and theory. The areas that will be covered are statistical inference using maximum likelihood and generalised linear models (GLMs). Building on an undergraduate level understanding of mathematics, statistics (hypothesis testing and linear regression) and probability (univariate discrete and continuous distributions; expectations, variances and covariances; the multivariate normal distribution), this module will motivate the need for a generic method for model fitting and then demonstrate how maximum likelihood provides a solution to this. Following on from this, GLMs, a widely and routinely used family of statistical models, will be introduced as an extension of the linear regression model.
-
Statistical Foundations I
This module will motivate the use of statistical modelling as a tool for making inference on a population given a sample of data. Students will be introduced to basic terminology of statistical modelling, and the similarities and differences between statistical and machine learning approaches will be discussed to lay the foundations for the development of both of these over the remaining core modules They will cover the concepts of sampling uncertainty, statistical inference and model fitting, with sampling uncertainty used to motivate the need for standard errors and confidence intervals. Once core concepts have been established, linear regression and generalised linear models will be introduced as essential statistical modelling tools. An understanding of these models will be obtained through implementation in the statistical software package R.