Today, we will learn how to use conduct a ML benchmark experiment in
R
using the mlr3
ecosystem.
In this practical, we will use a data set (drugs.csv
)
containing information on personality profiles of drug users and
non-users. It is a pre-processed data set to facilitate its usage in
this exercise. The original data set can be retrieved from the UCI
Machine Learning Repository (you can find the data here).
The following variables are available in the data set:
Age
: Age of participant (standardized)Gender
: Gender (standardized: -0.48246 = male, 0.48246
= female)Nscore
: Standardized neuroticism scoreEscore
: Standardized extraversion scoreOscore
: Standardized openness scoreAscore
: Standardized agreeableness scoreCscore
: Standardized conscientiousness scoreImpulsivity
: Standardized impulsivity scoreSS
: Standardized sensation seaking scoreUser
: Indicator whether participant is a
user/non-userThe goal is to predict whether someone is a user of non-user based on
all other variables using the mlr3verse
. Further
information about the functionalities of mlr3
can be found
at https://mlr3book.mlr-org.com/.
library(mlr3verse)
library(tidyverse)
library(ggplot2)
library(psych)
1. Get an impression of the data by looking at the structure of the data and creating some descriptive statistics to get an idea of which covariates may be indicative of drug usage.
2. To further explore the data we work with, create some interesting data visualizations that show whether there are interesting patterns in the data.
Hint: Think about adding a color aesthetic for the variable
User
.
4. Create a classification task in mlr3 defining the objective for the ML benchmark.
5. Set up a complex nested resampling strategy by defining an inner and an outer resampling. The inner resampling should be a simple train-test split (70-30). Choose a four-fold CV for the outer resampling
6. We want to tune some hyperparameters of two ML algorithms. Define a tuning scheme (i.e., a tuner and a terminator). For simplicity (and to save computation time), use random search with 200 iterations. Define a performance measure that should be optimized during hyperparamter tuning. Using the accuracy as a performance measure makes sense as most algorithms are trained to optimize the accuracy/MMCE (you can choose a different measure though, for example, if you want to focus more on one of the possible classification errors).`
7. Set up a list of learners that should be compared in this benchmark. Use a baseline (“featureless”) learner, a logistic regression model, a random forest (use the ranger implementation as it is the fastest), an SVM, and an XGBoost (gradient boosting algorithm).
Note. Set predict_type = "prob"
for each
algorithm to obtain predicted probabilities which allows us to calculate
a broader range of evaluation metrics.
8. We now want to tune hyperparameters of the XGBoost and the SVM. Define the parameter set that should be tuned and set up the respective modeling pipeline using the AutoTuner function. For the XGBoost, we want to tune the number of trees (nrounds) between 1 and 500, the learning rate (eta) between 0.001 and 0.2, and the maximum tree depth between 1 and 10. For the SVM, we want to tune the kernel function (i.e., choose between radial and linear) and the cost parameter between 0 and 50.
9. To exemplify how to integrate a complex modeling pipeline in a benchmark, we want test how a logistic regression model with variable selection performs in comparison to the logistic regression model that uses the full feature set. Therefore, create a filter that selects the two most relevant features based on the training data and create a so-called graph learner to include the complete modeling pipeline in the resampling procedure.
Note: Use the F-statistic of an ANOVA to select the features
that are most strongly related to the outcome. You can simply create a
filter with the flt()
-function (https://mlr3filters.mlr-org.com/reference/mlr_filters.html).
10. Set up and run the benchmark experiment. Make sure to parallelize the execution to save computation time.
Note: Set a seed to make the results reproducible!
11. Now aggregate the results using different performance metrics - accuracy, sensitivity, specificty and AUC. Also make sure to calculate not only the mean over the four folds but also the standard deviation to get an idea of how stable the performance estimates are.
12. Use the autoplot function from the mlr3viz
package to create a boxplot to visualize the accuracy of each learner in
the benchmark experiment.
When you have finished the practical,
enclose all files of the project (i.e. all .R
and/or
.Rmd
files including the one with your answers, and the
.Rproj
file) in a zip file, and
hand in the zip here. Do so before next week’s lecture.