In eight weeks we will dive into statistical learning and visualization in R. We will focus on supervised learning techniques and data wrangling in the context of data analysis and inference, as well as the connection to research philosophy. During every lecture we will treat a different theoretical aspect. Following each lecture there will be a computer lab exercise that connects the statistical theory to practice, as well as a Q&A meeting (Wednesdays @ 11am - Dalton 500 8.27) wherein you can pose any and all questions that remain unanswered about the course materials, theory, practice and your practical assignments.
The final grade is computed as follows
Graded part | Weight |
---|---|
Assignment 1 | 25 % |
Assignment 2 | 25 % |
Exam (BYOD) | 50 % |
To develop the necessary skills for completing the assignments and
the presentations, R
exercises must be made and submitted
via the course Blackboard page. These exercises are not graded, but
students must fulfill them to pass the course.
In order to pass the course, the final grade must be 5.5 or higher,
your contribution to the course should be sufficient, all R
exercises should be handed in and all assignments and the final exam
must have a passing grade. Otherwise, additional work is required
concerning the assignments and/or exercises you have failed.
Week # | Focus | Practical | Materials | Prof |
---|---|---|---|---|
1 | Data wrangling with R and the grammar of graphics |
tidyverse: filter(), select(), join(), pivot(), dbplyr ,
ggplot() : geoms , aesthetics ,
scales , themes |
R4DS | DG |
2 | Exploratory data analysis | Histograms , density plots ,
boxplots , etc. |
R4DS FIMD Ch1 | MC |
3 | Statistical learning: regression | lm() , glm() , knn() |
ISLR | MC |
4 | Statistical learning: classification | glm() , trees, lda() |
ISLR | EJvK |
5 | Classification model evaluation | prop.table() , pROC() , etc. |
ISLR | EJvK |
6 | Nonlinear models | R formulas advanced: I() , splines, e.g.,
bs() |
ISLR | MC |
7 | Bagging, boosting, random forest and support vector machines | randomforest , xgboost |
ISLR | DG |
8 | Benchmarking | mlr3verse |
ISLR mlr3-book | DG |
Supervised learning is such an integral part of contemporary data science, that you will most likely use it dozens of times a day, without knowing it. In this class you will learn about the most effective supervised learning techniques and you will acquire the skills to implement them to work for you.We will not only discuss the theoretical underpinnings of supervised learning, but focus also on the skills and experience to rapidly apply these techniques to new problems.
During this course, participants will actively learn how to apply the main statistical methods in data analysis and how to use machine learning algorithms and visualizing techniques. The course has a strongly practical, hands-on focus: rather than focusing on the mathematics and background of the discussed techniques, you will gain hands-on experience in using them on real data during the course and interpreting the results. This course provides a broad introduction to supervised learning and visualization. Topics include:
R
Students will learn to adapt these techniques in their way of
thinking about analyses problems. We will consider statistical learning
techniques in the context of estimation, testing and prediction.
Students will learn to adapt these techniques in their way of thinking
about statistical inference, which will help students to quantify the
uncertainty and measure the accuracy of statistical estimates. Students
will develop fundamental R
programming skills and will gain
experience with tidyverse
, visualize data with
ggplot2
and perform basic data wrangling techniques with
dplyr
. This course makes students better equipped for a
further career (e.g. junior researcher or research assistant) or
education in research, such as a (research) Master program, or a
PhD.
Students will form groups to choose work on two assignments. Students
will need to perform calculations and program code for these
assignments. All work needs to be combined in an easy understandable,
self-contained and insightful RStudio
project and must be
submitted to the course
Blackboard page. Each assignment will be graded.
Students will be evaluated on the following aspects:
R
R
R
In this course, skills and knowledge are evaluated on these separate occasions:
After taking this course students can understand innovations in statistical markup, statistical simulation and reproducible research. Students are also able to approach challenges from different professional viewpoints. They have gained experience in marking up a professional manuscript and designing a state-of-the-art statistical archive in an open source repository.
Dear all,
This semester you will participate in the Supervised Learning
& Visualization course at Utrecht University. To realize a
steeper learning curve, we will use some functionality that is not part
of the base installation for R
. Many of you are already
familiar with R
. The below guide serves as a point of
departure for those who are not. The following steps guide you through
installing both R
as well as some of the necessary
packages.
We look forward to seeing you all,
David Goretzko, Maarten Cruyff, and Erik-Jan van Kesteren
Bring a laptop computer to the course and make sure that you have full write access and administrator rights to the machine. We will explore programming and compiling in this course. This means that you need full access to your machine. Some corporate laptops come with limited access for their users, I therefore advice you to bring a personal laptop computer to the workgroup meetings.
R
R
can be obtained here. We won’t use R
directly in the course, but rather call R
through
RStudio
. Therefore it needs to be installed.
RStudio
DesktopRstudio is an Integrated Development Environment (IDE). It can be
obtained as stand-alone software here.
The free and open source RStudio Desktop
version is
sufficient.
Execute the following lines of code in the console window:
install.packages(c("ggplot2", "tidyverse", "magrittr", "micemd", "jomo", "pan",
"lme4", "knitr", "rmarkdown", "plotly", "ggplot2", "shiny",
"devtools", "boot", "class", "car", "MASS", "ggplot2movies",
"ISLR", "DAAG", "mice", "mlr3verse"),
dependencies = TRUE)
If you are not sure where to execute code, use the following figure to identify the console:
Just copy and paste the installation command and press the return key. When asked
type Yes
in the console and press the return key.
If all fails and you have insufficient rights to your machine, the following web-based service will offer a solution.
RStudio
environment there.R
and RStudio
there. You may need to
install packages for new sessions during the course.Naturally, you will need internet access for these services to be accessed.
If you want to use version control for your group work, you might want to check out GitHub. It is not mandatory to use it, but if you want to give it a try, you can study this GitHub tutorial. You can also use GitHub Desktop to handle your commits and merge requests. Don’t make your workflow too complicated, though!
Please find the Lecture Slides here
This week’s practical is split into two parts - practical 1 on data wrangling and basics in R and practical 2 on data visualization with ggplot2. The answers are also given to you. Use these answers to help yourself when you’re stuck.
Hand in your answers to both practicals separately via blackboard.
dplyr
and
the tidyverse
is very informative: The following lecture by Dewey
Dunnington is quite informative. Also for those that already (think
they) know the ggplot2
package.
The following links may be useful
This week’s slides:
titanic
data in R
, but I found this data
set and its labeling to be more informative.This week’s practical is on exploratory data analysis. Code solutions are also given to you. Use these answers to help yourself when you’re stuck.
Hand in via blackboard.
Study the following materials
The following links may be useful
ggplot
’s
reference pageThis week’s practical is on regression. Code solutions are also given to you. Use these answers to help yourself when you’re stuck.
Hand in via blackboard.
This week’s practical is on classification. Code solutions are also given to you. Use these answers to help yourself when you’re stuck.
Hand in via blackboard.
NOT HAND IN, BUT STILL USEFUL: I also give you the
following practical from last year. Please view and study the practical
up to and including Exercise 10. No need to hand anything in. All the
regularization (glmnet
) exercises 11-19 are not
specifically covered in this course, but you may of course use the
methods in your Assignment 2 if you wish. Let us know if you have any
questions.
This week’s practical is on classification. Code solutions are also given to you. Use these answers to help yourself when you’re stuck.
Hand in via blackboard.
This week’s practical is on nonlinear regression. Code solutions are also given to you. Use these answers to help yourself when you’re stuck.
Hand in via blackboard.
This week’s practical is on nonlinear regression. Code solutions are also given to you. Use these answers to help yourself when you’re stuck.
Hand in via blackboard.
This week’s practical is about conducting ML benchmarks with
mlr3
. Code solutions are also given to you. Use these
answers to help yourself when you’re stuck.
Hand in via blackboard.
Recap: ISLR Chapters 4,5,8 and 9
The following links may be useful:
especially: