Skip to main content

Data analysis: Statistical learning and visualization (FMSF90F)

20 January - 21 March 2025, 7.5 ECTS

– Published 9 December 2024

This is a PhD-course in applied statistical learning, i.e. about using statistical techniques, such as modelling and prediction, to analyse real datasets, and making correct interpretations and conclusions.

The course begins with an overview of basic data wrangling and visualisation, with a focus on the ability to identify and illustrate important features in data. Then important methods in statistical learning are introduced. Emphasis is given to supervised learning for regression and classification, and issues arising when fitting and evaluating flexible models are extensively covered. The course also includes an introduction to unsupervised learning.

The course consists of three modules, each of them taught by lectures, supervised computer labs and a written assignment. Participants in the course are encouraged to apply their data wrangling and modelling skills in an in-class data science competition on Kaggle. Finally, a project serves as a synthesis of the full course content. For that project, we encourage PhD-students to analyse data from their own research, if it can be meaningfully done with the methods taught in the course. The course will use R (statistical computing software www.r-project.org) for practicals and supervision. More information and access to the course plan is available here.

Course content

  • Visualisation and basic data handling (import/export, cleaning, transforming and summarizing data)

  • Supervised learning: regression and decision tree methods for classification and regression problems (LASSO and ridge regression, random forest, XGBoost)

  • Performance evaluation, model selection and validation (including bootstrap and cross-validation)

  • Introduction to unsupervised learning (clustering and principal components analysis)

Prerequisites

  • Basic statistics course
  • Some programming experience
  • Access to a laptop computer with the ability to install R and R-studio

Schedule

20 January - 21 March 2025

  • 13 lectures
  • 12 practicals in R

Preliminary schedule (Note that due to the size of the class, we are offering 2 identical instances of each practical session)

Examination

  • 3 module assignments (with peer review)
  • 1 final project (written report + oral presentation)

Examination is through three (peer-reviewed) coding assignments (including peer review), one for each module and a final project using learnings from the entire course. The final project includes an oral presentation to the class. For the final project we encourage PhD-students to analyse data from their own research, if it can be meaningfully done with the methods taught in the course

Textbooks

  • Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, “An Introduction to Statistical Learning with Applications in R”, 2ed. Springer, 2021, ISBN: 978-1-0716-1417-4, available as e-book
  • Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for data science. O'Reilly Media, Inc. Available as e-book.

Teachers

Linda Hartman & Dmytro Perepolkin (Mathematical Statistics)

Registration

Places are limited, early registration is recommended. Registration deadline is 8th January 2024.

Please fill in the registration form to register for the course.