- Resampling (Recap + Nested Resampling)
- Hyperparameter Tuning
- Model Classes
- Benchmarking
Supervised Learning and Visualization
When tuning hyperparameters, a more complex resampling strategy is necessary:
This is necessary to obtain “fair”, i.e., not overoptimisitc performance estimates
Although random forests usually work relatively well with default settings, tuning may be beneficial to improve accuracy (classification) or increase the precision (regression task)
Remember: Tuning and selection of hyperparameters needs to be separated from performance evaluation (inner resampling strategy within modeling pipeline)!
mtry | ntrees |
---|---|
10 | 100 |
15 | 100 |
20 | 100 |
10 | 250 |
15 | 250 |
20 | 250 |
10 | 500 |
15 | 500 |
20 | 500 |
mtry | ntrees | MSE |
---|---|---|
10 | 100 | 0.584 |
15 | 100 | 0.506 |
20 | 100 | 0.429 |
10 | 250 | 0.536 |
15 | 250 | 0.428 |
20 | 250 | 0.525 |
10 | 500 | 0.322 |
15 | 500 | 0.476 |
20 | 500 | 0.409 |
Easy to imagine that grid search is only feasible if few values for some selected hyperparameters are tested
The computational demand increases drastically when many hyperparameters (see, for example, the XGBoost) should be optimized and many potential values (and their interactions) have to be evaluated:
Random search as an alternative with limited budget (can be more efficient, especially when some hyperparameters do not strongly impact the model performance)
Some ML algorithms are more strongly affected by hyperparameter settings than others - i.e., they have a higher tunability which means that have to tune the hyperparamters to obtain good predictions (Probst et al., 2019).
Contrary to classical statistical modeling, machine learning is usually used in a more exploratory and purely data-driven manner. That said, we do not know beforehand how the outcome is related to the predictors and therefore, we also do not know which algorithm is going to work best for a specific data set/ prediction task.
Idea: Conducting a small benchmark experiment to compare different algorithms
All models are trained on the same data (e.g., using the same folds in CV)
Hyperparameter tuning should be “fair”:
Always include a meaningful baseline for comparison, e.g., a featureless learner predicting the outcomes mean (regression task) or mode (classification task)
Often a good idea to compare a linear model (e.g., LASSO) to one or more complex models
If the more complex model does not outperform a simpler, more interpretable model, the latter is clearly favourable
Sometimes it makes sense to look at different performance measures in a benchmark experiment:
So far, we have used different packages to train different models (e.g., rpart, ranger), mlr3 provides an ecosystem with wrapper functions for the most common algorithms
It also facilitates:
- resampling (inner and outer resampling strategy) and performance evaluation - data pre-processing through graph-learners (i.e., creating a data analysis pipeline) - hyperparameter tuning with state-of-the-art tuning schemes (model-based optimization, iterative racing, etc.) - model comparisons via benchmarking - the implementation of cost-sensitive learning (e.g., Sterner, Pargent, & Goretzko, 2023) - model stacking - interpretable machine learning - visualization
Create a task (i.e., what should be predicted and which data is used):
# here classification task task <- TaskClassif$new(id = "ID", backend = data, # data set target = "Outcome", positive = "Positive Class") # only in binary classification task
Define resampling scheme (inner and outer resampling), e.g.,:
# define resampling using pre-specified resampling concepts such as RCV and CV res_outer <- rsmp("repeated_cv", repeats = 3, folds = 5) res_inner <- rsmp("cv", folds = 5)
Define tuning, if necessary:
# define tuning, e.g., using random search with 1000 iterations terminator <- trm("evals", n_evals = 1000) tuner <- tnr("random_search")
Define performance measure for inner resampling, i.e., that is used to select hyperparemeter settings
# define inner performance measure, e.g., accuracy mes_acc <- msr("classif.acc")
Define algorithms (mlr3: learners) that should be tested:
# define learners, use featureless learner as baseline; # here: classical logistic regression (simple model), random forest ("ranger") # and xgboost baseline <- lrn("classif.featureless", predict_type = "prob") logreg <- lrn("classif.log_reg", predict_type = "prob") ranger <- lrn("classif.ranger", predict_type = "prob") xgboost <- lrn("classif.xgboost", predict_type = "prob")
Define hyperparameters that should be tuned and create modeling pipeline for learners whose hyperparameters need tuning:
# Example: xgboost for which two hyperparameters are tuned: param_set_xgb <- ps( nrounds = p_int(lower = 1, upper = 500), eta = p_dbl(lower = 0.001, upper = 0.01)) xgb_tuned <- AutoTuner$new( learner = xgboost, # basic xgboost learner (see above) resampling = res_inner, # inner resampling to evaluate different param sets measure = mes_acc, # performance measure to select parameters (see above) search_space = param_set_xgb, # parameter range that is tested terminator = terminator, # termination criterion tuner = tuner) # tuning scheme (here: random search) # new id to find model in output # (necessary if multiple xgbs with different tuning schemes are evaluated) xgb_tuned$id <- "classif.xgboost.tuned"
Set up benchmark grid with all learners and tasks (here only one task, but theoretically you could benchmark different algorithms using different task which would be a benchmark experiment that rather focuses on the methodological differences of the algorithms and tested pipelines than the specific tasks)
# Create benchmark grid defining the task(s), learners and the # resampling strategy used for performance evaluation (i.e., outer resampling) grid <- benchmark_grid(tasks = list(task), learners = list(baseline, logreg, ranger, xgb_tuned), resamplings = list(res_outer) )
Run the benchmark experiment
results <- benchmark(grid)
Analyze results using different performance metrics, e.g., accuracy, sensitivity, and specificity as well as obtaining standard deviations for all metrics to get an idea of how stable/unstable the aggregated parameter estimates are:
mes_list <- list( msr("classif.sensitivity"), msr("classif.sensitivity", id = "classif.sensitivity.sd", aggregator = sd), msr("classif.specificity"), msr("classif.specificity", id = "classif.specificity.sd", aggregator = sd), msr("classif.acc"), msr("classif.acc", id = "classif.acc.sd", aggregator = sd) ) results$aggregate(mes_list)
Quick visualization of the results:
mlr3viz::autoplot(results, measure = msr("classif.acc"), type = "boxplot")
Load the complete mlr3verse to get the full access to all wrapper functions for a variety of algorithms (mlr3extralearners), tools for visualizing results (mlr3viz), and many more (advanced) tools for conducting ML experiments in R:
More detailed descriptions and tutorials to deal with more advanced problems can be found under https://mlr3.mlr-org.com/
Other popular frameworks include tidymodels
(https://www.tidymodels.org/), caret
(https://topepo.github.io/caret/), or h2o
(https://cran.r-project.org/web/packages/h2o/index.html)