Today, we will learn how to use different ensemble methods in
R
, recap on how to evaluate the performance of the methods,
and learn how we can substantively interpret the model output.
In this practical we will work with the ILPD (Indian Liver Patient Dataset) from the UCI Machine Learning Repository (you can find the data here). This data set contains data on 414 liver disease patients, and 165 non-patients. In general, medical researchers have two distinct goals when doing research: (1) to be able to classify people in their waiting room as either patients or non-patients, and (2) get insight into the factors that are associated with the disease. In this practical, we will look at both aspects.
In this practical, we will use the tidyverse
,
magrittr
, psych
, GGally
and
caret
libraries.
library(tidyverse)
library(magrittr)
library(psych)
library(caret)
library(gbm)
library(xgboost)
library(data.table)
library(ggforce)
First, we specify a seed and load the training data. We will use this data to make inferences and to train a prediction model.
set.seed(45)
df <- readRDS("data/train_disease.RDS")
1. Get an impression of the data by looking at the structure of the data and creating some descriptive statistics.
2. To further explore the data we work with, create some interesting data visualizations that show whether there are interesting patterns in the data.
Hint: Think about adding a color aesthetic for the variable
Disease
.
3. Shortly reflect on the difference between bagging, random forests, and boosting.
We are going to apply different machine learning models using the
caret
package.
4. Apply bagging to the training data, to predict the outcome
Disease
, using the caret
library.
Note. We first specify the internal validation settings, like so:
cvcontrol <- trainControl(method = "repeatedcv",
number = 10,
allowParallel = TRUE)
These settings can be inserted within the train
function
from the caret
package. Make sure to use the
treebag
method, to specify cvcontrol
as the
trControl
argument and to set
importance = TRUE
.
5. Interpret the variable importance measure using the
varImp
function on the trained model object.
6. Create training set predictions based on the bagged model,
and use the confusionMatrix()
function from the
caret
package to assess it’s performance.`
Hint: You will have to create predictions based on the trained model for the training data, and evaluate these against the observed values of the training data.
7. Now ask for the output of the bagged model. Explain why the under both approaches differ.
We will now follow the same approach, but rather than bagging, we will train a random forest on the training data.
8. Fit a random forest to the training data to predict the
outcome Disease
, using the caret
library.
Note. Use the same cvcontrol
settings as in the
previous model.
9. Again, interpret the variable importance measure using the
varImp
function on the trained model object. Do you draw
the same conclusions as under the bagged model?
10. Output the model output from the random forest. Are we doing better than with the bagged model?
11. Now, fit a boosting model using the caret
library to predict disease status.`
Hint: Use gradient boosting (the gbm
method in
caret
).
12. Again, interpret the variable importance measure. You
will have to call for summary()
on the model object you
just created. Compare the output to the previously obtained variable
importance measures.
13. Output the model output from our gradient boosting procedure. Are we doing better than with the bagged and random forest model?
For now, we will continue with extreme gradient boosting, although we will use a difference procedure.
We will use xgboost
to train a binary classification
model, and create some visualizations to obtain additional insight in
our model. We will create the visualizations using SHAP
(SHapley Additive
exPlanations) values, which are a measure of importance
of the variables in the model. In fact, SHAP
values
indicate the influence of each input variable on the predicted
probability for each person. Essentially, these give an indication of
the difference between the predicted probability with and without that
variable, for each person’s score.
14. Download the file shap.R
from this Github
repository.
Note. There are multiple ways to this, of which the simplest is to run the following code.
library(devtools)
source_url("https://github.com/pablo14/shap-values/blob/master/shap.R?raw=TRUE")
Additionally, you could simply go to the file shap.R
and
copy-and-paste the code into the current repository. However, you could
also fork and clone the repository, to make adjustments to the functions
that are already created.
15. Specify your model as follows, and use it to create predictions on the training data.
train_x <- model.matrix(Disease ~ ., df)[,-1]
train_y <- as.numeric(df$Disease) - 1
xgboost_train <- xgboost(data = train_x,
label = train_y,
max.depth = 10,
eta = 1,
nthread = 4,
nrounds = 4,
objective = "binary:logistic",
verbose = 2)
pred <- tibble(Disease = predict(xgboost_train, newdata = train_x)) %>%
mutate(Disease = factor(ifelse(Disease < 0.5, 1, 2),
labels = c("Healthy", "Disease")))
table(pred$Disease, df$Disease)
16. First, calculate the SHAP
rank scores for
all variables in the data, and create a variable importance plot using
these values. Interpret the plot.
17. Plot the SHAP
values for every individual
for every feature and interpret them.
18. Verify which of the models you created in this practical performs best on the test data.
When you have finished the practical,
enclose all files of the project (i.e. all .R
and/or
.Rmd
files including the one with your answers, and the
.Rproj
file) in a zip file, and
hand in the zip here. Do so before next week’s lecture.