Introduction

Today, we will learn how to use conduct a ML benchmark experiment in R using the mlr3 ecosystem.

In this practical, we will use a data set (drugs.csv) containing information on personality profiles of drug users and non-users. It is a pre-processed data set to facilitate its usage in this exercise. The original data set can be retrieved from the UCI Machine Learning Repository (you can find the data here).

The following variables are available in the data set:

Age: Age of participant (standardized)
Gender: Gender (standardized: -0.48246 = male, 0.48246 = female)
Nscore: Standardized neuroticism score
Escore: Standardized extraversion score
Oscore: Standardized openness score
Ascore: Standardized agreeableness score
Cscore: Standardized conscientiousness score
Impulsivity: Standardized impulsivity score
SS: Standardized sensation seaking score
User: Indicator whether participant is a user/non-user

The goal is to predict whether someone is a user of non-user based on all other variables using the mlr3verse. Further information about the functionalities of mlr3 can be found at https://mlr3book.mlr-org.com/.

library(mlr3verse)
library(tidyverse)
library(ggplot2)
library(psych)

1. Get an impression of the data by looking at the structure of the data and creating some descriptive statistics to get an idea of which covariates may be indicative of drug usage.

2. To further explore the data we work with, create some interesting data visualizations that show whether there are interesting patterns in the data.

Hint: Think about adding a color aesthetic for the variable User.

4. Create a classification task in mlr3 defining the objective for the ML benchmark.

5. Set up a complex nested resampling strategy by defining an inner and an outer resampling. The inner resampling should be a simple train-test split (70-30). Choose a four-fold CV for the outer resampling

6. We want to tune some hyperparameters of two ML algorithms. Define a tuning scheme (i.e., a tuner and a terminator). For simplicity (and to save computation time), use random search with 200 iterations. Define a performance measure that should be optimized during hyperparamter tuning. Using the accuracy as a performance measure makes sense as most algorithms are trained to optimize the accuracy/MMCE (you can choose a different measure though, for example, if you want to focus more on one of the possible classification errors).`

7. Set up a list of learners that should be compared in this benchmark. Use a baseline (“featureless”) learner, a logistic regression model, a random forest (use the ranger implementation as it is the fastest), an SVM, and an XGBoost (gradient boosting algorithm).

Note. Set predict_type = "prob" for each algorithm to obtain predicted probabilities which allows us to calculate a broader range of evaluation metrics.

8. We now want to tune hyperparameters of the XGBoost and the SVM. Define the parameter set that should be tuned and set up the respective modeling pipeline using the AutoTuner function. For the XGBoost, we want to tune the number of trees (nrounds) between 1 and 500, the learning rate (eta) between 0.001 and 0.2, and the maximum tree depth between 1 and 10. For the SVM, we want to tune the kernel function (i.e., choose between radial and linear) and the cost parameter between 0 and 50.

9. To exemplify how to integrate a complex modeling pipeline in a benchmark, we want test how a logistic regression model with variable selection performs in comparison to the logistic regression model that uses the full feature set. Therefore, create a filter that selects the two most relevant features based on the training data and create a so-called graph learner to include the complete modeling pipeline in the resampling procedure.

Note: Use the F-statistic of an ANOVA to select the features that are most strongly related to the outcome. You can simply create a filter with the flt()-function (https://mlr3filters.mlr-org.com/reference/mlr_filters.html).

10. Set up and run the benchmark experiment. Make sure to parallelize the execution to save computation time.

Note: Set a seed to make the results reproducible!

11. Now aggregate the results using different performance metrics - accuracy, sensitivity, specificty and AUC. Also make sure to calculate not only the mean over the four folds but also the standard deviation to get an idea of how stable the performance estimates are.

12. Use the autoplot function from the mlr3viz package to create a boxplot to visualize the accuracy of each learner in the benchmark experiment.

Hand-in

When you have finished the practical,

enclose all files of the project (i.e. all .R and/or .Rmd files including the one with your answers, and the .Rproj file) in a zip file, and
hand in the zip here. Do so before next week’s lecture.

ML Benchmarks

Introduction

Hand-in