Introduction

Today, we will learn how to use different ensemble methods in R, recap on how to evaluate the performance of the methods, and learn how we can substantively interpret the model output.

In this practical we will work with the ILPD (Indian Liver Patient Dataset) from the UCI Machine Learning Repository (you can find the data here). This data set contains data on 414 liver disease patients, and 165 non-patients. In general, medical researchers have two distinct goals when doing research: (1) to be able to classify people in their waiting room as either patients or non-patients, and (2) get insight into the factors that are associated with the disease. In this practical, we will look at both aspects.

In this practical, we will use the tidyverse, magrittr, psych, GGally and caret libraries.

library(tidyverse)
library(magrittr)
library(psych)
library(caret)
library(gbm)
library(xgboost)
library(data.table)
library(ggforce)

First, we specify a seed and load the training data. We will use this data to make inferences and to train a prediction model.

set.seed(45)
df <- readRDS("data/train_disease.RDS")

1. Get an impression of the data by looking at the structure of the data and creating some descriptive statistics.

2. To further explore the data we work with, create some interesting data visualizations that show whether there are interesting patterns in the data.

Hint: Think about adding a color aesthetic for the variable Disease.

3. Shortly reflect on the difference between bagging, random forests, and boosting.

We are going to apply different machine learning models using the caret package.

4. Apply bagging to the training data, to predict the outcome Disease, using the caret library.

Note. We first specify the internal validation settings, like so:

cvcontrol <- trainControl(method = "repeatedcv", 
                          number = 10,
                          allowParallel = TRUE)

These settings can be inserted within the train function from the caret package. Make sure to use the treebag method, to specify cvcontrol as the trControl argument and to set importance = TRUE.

5. Interpret the variable importance measure using the varImp function on the trained model object.

6. Create training set predictions based on the bagged model, and use the confusionMatrix() function from the caret package to assess it’s performance.`

Hint: You will have to create predictions based on the trained model for the training data, and evaluate these against the observed values of the training data.

7. Now ask for the output of the bagged model. Explain why the under both approaches differ.

We will now follow the same approach, but rather than bagging, we will train a random forest on the training data.

8. Fit a random forest to the training data to predict the outcome Disease, using the caret library.

Note. Use the same cvcontrol settings as in the previous model.

9. Again, interpret the variable importance measure using the varImp function on the trained model object. Do you draw the same conclusions as under the bagged model?

10. Output the model output from the random forest. Are we doing better than with the bagged model?

11. Now, fit a boosting model using the caret library to predict disease status.`

Hint: Use gradient boosting (the gbm method in caret).

12. Again, interpret the variable importance measure. You will have to call for summary() on the model object you just created. Compare the output to the previously obtained variable importance measures.

13. Output the model output from our gradient boosting procedure. Are we doing better than with the bagged and random forest model?

For now, we will continue with extreme gradient boosting, although we will use a difference procedure.

We will use xgboost to train a binary classification model, and create some visualizations to obtain additional insight in our model. We will create the visualizations using SHAP (SHapley Additive exPlanations) values, which are a measure of importance of the variables in the model. In fact, SHAP values indicate the influence of each input variable on the predicted probability for each person. Essentially, these give an indication of the difference between the predicted probability with and without that variable, for each person’s score.

14. Download the file shap.R from this Github repository.

Note. There are multiple ways to this, of which the simplest is to run the following code.

library(devtools)
source_url("https://github.com/pablo14/shap-values/blob/master/shap.R?raw=TRUE")

Additionally, you could simply go to the file shap.R and copy-and-paste the code into the current repository. However, you could also fork and clone the repository, to make adjustments to the functions that are already created.

15. Specify your model as follows, and use it to create predictions on the training data.

train_x <- model.matrix(Disease ~ ., df)[,-1]
train_y <- as.numeric(df$Disease) - 1
xgboost_train <- xgboost(data = train_x,
                         label = train_y, 
                         max.depth = 10,
                         eta = 1,
                         nthread = 4,
                         nrounds = 4,
                         objective = "binary:logistic",
                         verbose = 2)



pred <- tibble(Disease = predict(xgboost_train, newdata = train_x)) %>%
  mutate(Disease = factor(ifelse(Disease < 0.5, 1, 2),
                          labels = c("Healthy", "Disease")))

table(pred$Disease, df$Disease)

16. First, calculate the SHAP rank scores for all variables in the data, and create a variable importance plot using these values. Interpret the plot.

17. Plot the SHAP values for every individual for every feature and interpret them.

18. Verify which of the models you created in this practical performs best on the test data.

Hand-in

When you have finished the practical,

enclose all files of the project (i.e. all .R and/or .Rmd files including the one with your answers, and the .Rproj file) in a zip file, and
hand in the zip here. Do so before next week’s lecture.

Ensemble methods

Introduction

Hand-in