Supervised Learning and Visualization

This lecture

  1. Resampling (Recap + Nested Resampling)
  2. Hyperparameter Tuning
  3. Model Classes
  4. Benchmarking

Resampling

Recap: Cross-Validation (CV)

  • Random partitioning in \(k\) equally sized parts (usually 5 or 10)
  • \(k-1\) parts serve as training set, remaining fold is test set
  • Each part is used as the test set once
  • The estimated prediction error is averaged over all folds

Cross-Validation (CV)

Repeated Cross-Validation (RCV)

  • same as CV, but multiple runs with different seeds (e.g., \(10 \times 10-\)fold CV)
  • different seeds = subset structures are different
  • averaging performance estimates over repetitions stabilizes results (especially in combination with small sample sizes) and can be helpful when using them to perform a significance test (is model A better than model B)

Nested Resampling

When tuning hyperparameters, a more complex resampling strategy is necessary:

  • inner resampling is used to select hyperparameters and pre-process the data (select variables, impute missing data, scale variables, etc.)
  • outer resampling is used to evaluate the full data analysis pipeline

This is necessary to obtain “fair”, i.e., not overoptimisitc performance estimates

Nested Resampling

Hyperparameter Tuning (Example: Tree-based Methods)

Decision Trees

  • Recursive binary splitting to find subsets that are homogeneous regarding outcome:

Random Forests

  • Aggregation of multiple decision trees
  • Combination of Bagging (Bootstrap Aggregation) and variable selection
  • Ensemble of de-correlated deep trees promises better out-of-sample performance:

Hyperparameters of Random Forest

  • Minimum node size in node to attempt a split
  • Maximum tree depth
  • Minimum impurity reduction
  • Minumum number of observations in leaf nodes
  • Number of trees
  • Number of predictors considered for each split (often called mtry)

Hyperparameters of Random Forest

  • Although random forests usually work relatively well with default settings, tuning may be beneficial to improve accuracy (classification) or increase the precision (regression task)

  • Remember: Tuning and selection of hyperparameters needs to be separated from performance evaluation (inner resampling strategy within modeling pipeline)!

Hyperparamer Tuning - Grid Search

Hyperparamer Tuning - Grid Search

mtry ntrees MSE
10 100 0.584
15 100 0.506
20 100 0.429
10 250 0.536
15 250 0.428
20 250 0.525
10 500 0.322
15 500 0.476
20 500 0.409
  • Hyperparameter settings with the best performance metric (here: smallest MSE) are selected

Hyperparamer Tuning - Random Search

Hyperparameter Tuning - More Advandced Approaches

  • Model-based/ Bayesian Optimization (MBO)
  • (Generalized) Simulated Annealing
  • Iterative Racing:

Tunability of Algorithms

Some ML algorithms are more strongly affected by hyperparameter settings than others - i.e., they have a higher tunability which means that have to tune the hyperparamters to obtain good predictions (Probst et al., 2019).

  • the random forest is an example of a good “of the shelf algorithm”
  • SVM requires careful tuning of kernel function
  • XGBoost with many hyperparameters usually requires extensive tuning

Benchmarks

Which Model Should We Choose?

Contrary to classical statistical modeling, machine learning is usually used in a more exploratory and purely data-driven manner. That said, we do not know beforehand how the outcome is related to the predictors and therefore, we also do not know which algorithm is going to work best for a specific data set/ prediction task.

Model Classes

  • (Generalized) linear models + polynomial regression
  • Generalized additive models (inter alia using splines)
  • Support vector machines
  • k-nearest-neighbors
  • Tree-based methods (bagging, boosting)
  • neural networks
  • …

Flexibility-Interpretability Trade-Off

Which Model Should We Choose?

  • Idea: Conducting a small benchmark experiment to compare different algorithms

  • All models are trained on the same data (e.g., using the same folds in CV)

  • Hyperparameter tuning should be “fair”:

    • some algorithms are good off-the-shelf algorithms that work reasonably well with default parameters (e.g., random forests), while for other algorithms it is almost impossible to come up with appropriate defaults (e.g., support vector machines)
    • if one model is extensively tuned and another model is not, the question remains whether the model with highly optimized hyperparameters is actually better
  • Always include a meaningful baseline for comparison, e.g., a featureless learner predicting the outcomes mean (regression task) or mode (classification task)

Benchmark Studies

  • Often a good idea to compare a linear model (e.g., LASSO) to one or more complex models

  • If the more complex model does not outperform a simpler, more interpretable model, the latter is clearly favourable

  • Sometimes it makes sense to look at different performance measures in a benchmark experiment:

    • e.g., while Model A shows the highest accuracy, Model B might be more sensitive (higher true positive rate/ recall) which could be preferable in a situation, where a rare event (e.g., a disease) should be predicted
    • sometimes a tie between two models regarding a primary outcome can be broken using a secondary performance metric

ML in R - The mlr3verse

  • So far, we have used different packages to train different models (e.g., rpart, ranger), mlr3 provides an ecosystem with wrapper functions for the most common algorithms

  • It also facilitates:

    - resampling (inner and outer resampling strategy) and performance evaluation
    - data pre-processing through graph-learners (i.e., creating a data analysis pipeline)
    - hyperparameter tuning with state-of-the-art tuning schemes
    (model-based optimization, iterative racing, etc.)
    - model comparisons via benchmarking
    - the implementation of cost-sensitive learning (e.g., Sterner, Pargent, & Goretzko, 2023)
    - model stacking
    - interpretable machine learning
    - visualization

ML Benchmarks in mlr3

Create a task (i.e., what should be predicted and which data is used):

# here classification task
task <- TaskClassif$new(id = "ID",
  backend = data, # data set
  target = "Outcome", 
  positive = "Positive Class") # only in binary classification task

Define resampling scheme (inner and outer resampling), e.g.,:

# define resampling using pre-specified resampling concepts such as RCV and CV
res_outer <- rsmp("repeated_cv", repeats = 3, folds = 5)
res_inner <- rsmp("cv", folds = 5)

ML Benchmarks in mlr3

Define tuning, if necessary:

# define tuning, e.g., using random search with 1000 iterations
terminator <- trm("evals", n_evals = 1000)
tuner <- tnr("random_search")

Define performance measure for inner resampling, i.e., that is used to select hyperparemeter settings

# define inner performance measure, e.g., accuracy
mes_acc <- msr("classif.acc")

ML Benchmarks in mlr3

Define algorithms (mlr3: learners) that should be tested:

# define learners, use featureless learner as baseline;
# here: classical logistic regression (simple model), random forest ("ranger")
# and xgboost
baseline <- lrn("classif.featureless",  predict_type = "prob")
logreg <- lrn("classif.log_reg", predict_type = "prob")
ranger <- lrn("classif.ranger", predict_type = "prob")
xgboost <- lrn("classif.xgboost", predict_type = "prob")

ML Benchmarks in mlr3

Define hyperparameters that should be tuned and create modeling pipeline for learners whose hyperparameters need tuning:

# Example: xgboost for which two hyperparameters are tuned:
param_set_xgb <- ps(
  nrounds = p_int(lower = 1, upper = 500),
  eta = p_dbl(lower = 0.001, upper = 0.01))

xgb_tuned <- AutoTuner$new(
  learner = xgboost, # basic xgboost learner (see above)
  resampling = res_inner, # inner resampling to evaluate different param sets
  measure = mes_acc, # performance measure to select parameters (see above)
  search_space = param_set_xgb, # parameter range that is tested
  terminator = terminator, # termination criterion
  tuner = tuner) # tuning scheme (here: random search)
# new id to find model in output 
# (necessary if multiple xgbs with different tuning schemes are evaluated) 
xgb_tuned$id <- "classif.xgboost.tuned" 

ML Benchmarks in mlr3

Set up benchmark grid with all learners and tasks (here only one task, but theoretically you could benchmark different algorithms using different task which would be a benchmark experiment that rather focuses on the methodological differences of the algorithms and tested pipelines than the specific tasks)

# Create benchmark grid defining the task(s), learners and the 
# resampling strategy used for performance evaluation (i.e., outer resampling)
grid <- benchmark_grid(tasks = list(task),
                       learners = list(baseline, logreg, ranger, xgb_tuned),
                       resamplings = list(res_outer)
                       )

Run the benchmark experiment

results <- benchmark(grid)

ML Benchmarks in mlr3

Analyze results using different performance metrics, e.g., accuracy, sensitivity, and specificity as well as obtaining standard deviations for all metrics to get an idea of how stable/unstable the aggregated parameter estimates are:

mes_list <- list(
  msr("classif.sensitivity"),
  msr("classif.sensitivity", id = "classif.sensitivity.sd", aggregator = sd),
  msr("classif.specificity"),
  msr("classif.specificity", id = "classif.specificity.sd", aggregator = sd),
  msr("classif.acc"),
  msr("classif.acc", id = "classif.acc.sd", aggregator = sd)
  )

results$aggregate(mes_list)

Quick visualization of the results:

mlr3viz::autoplot(results, measure = msr("classif.acc"), type = "boxplot")

ML Benchmarks in mlr3

  • Load the complete mlr3verse to get the full access to all wrapper functions for a variety of algorithms (mlr3extralearners), tools for visualizing results (mlr3viz), and many more (advanced) tools for conducting ML experiments in R:

    • Very convenient (all-in-one framework)
    • Easy to extend (S6 class)
    • Better control over train-test separation (avoid data leakage!)
    • most comprehensive ecosystem in R that is constantly updated
  • More detailed descriptions and tutorials to deal with more advanced problems can be found under https://mlr3.mlr-org.com/

  • Other popular frameworks include tidymodels (https://www.tidymodels.org/), caret (https://topepo.github.io/caret/), or h2o (https://cran.r-project.org/web/packages/h2o/index.html)