Today, we will learn how to use different ensemble methods in
R
, recap on how to evaluate the performance of the methods,
and learn how we can substantively interpret the model output.
In this practical we will work with the ILPD (Indian Liver Patient Dataset) from the UCI Machine Learning Repository (you can find the data here). This data set contains data on 414 liver disease patients, and 165 non-patients. In general, medical researchers have two distinct goals when doing research: (1) to be able to classify people in their waiting room as either patients or non-patients, and (2) get insight into the factors that are associated with the disease. In this practical, we will look at both aspects.
In this practical, we will use the tidyverse
,
magrittr
, psych
, GGally
and
caret
libraries.
library(tidyverse)
library(magrittr)
library(psych)
library(caret)
library(gbm)
library(xgboost)
library(data.table)
library(ggforce)
First, we specify a seed and load the training data. We will use this data to make inferences and to train a prediction model.
set.seed(45)
df <- readRDS("data/train_disease.RDS")
1. Get an impression of the data by looking at the structure of the data and creating some descriptive statistics.
head(df)
tail(df)
df %>%
select(-c(Gender, Disease)) %>%
describeBy(df$Disease, fast = TRUE)
##
## Descriptive statistics by group
## group: Healthy
## vars n mean sd min max range se
## Age 1 360 45.81 15.60 10.0 90.0 80.0 0.82
## Total_Bilirubin 2 360 4.05 7.15 0.4 75.0 74.6 0.38
## Direct_Bilirubin 3 360 1.86 3.16 0.1 19.7 19.6 0.17
## Alkaline_Phosphotase 4 360 309.47 250.37 63.0 2110.0 2047.0 13.20
## Alamine_Aminotransferase 5 360 97.19 200.17 12.0 1680.0 1668.0 10.55
## Aspartate_Aminotransferase 6 360 133.66 324.59 11.0 4929.0 4918.0 17.11
## Total_Protiens 7 360 6.47 1.11 2.7 9.6 6.9 0.06
## Albumin 8 360 3.07 0.80 0.9 5.5 4.6 0.04
## Ratio_Albumin_Globulin 9 360 0.92 0.32 0.3 2.8 2.5 0.02
## ------------------------------------------------------------
## group: Disease
## vars n mean sd min max range se
## Age 1 140 41.51 16.98 4.00 85.0 81.00 1.43
## Total_Bilirubin 2 140 1.14 1.01 0.50 7.3 6.80 0.09
## Direct_Bilirubin 3 140 0.40 0.53 0.10 3.6 3.50 0.04
## Alkaline_Phosphotase 4 140 220.56 149.76 90.00 1580.0 1490.00 12.66
## Alamine_Aminotransferase 5 140 33.31 24.50 10.00 181.0 171.00 2.07
## Aspartate_Aminotransferase 6 140 39.79 34.79 10.00 285.0 275.00 2.94
## Total_Protiens 7 140 6.53 1.07 3.70 9.2 5.50 0.09
## Albumin 8 140 3.34 0.80 1.40 5.0 3.60 0.07
## Ratio_Albumin_Globulin 9 140 1.04 0.30 0.37 1.9 1.53 0.03
# It becomes directly clear that there are substantial differences between
# the diseased and non-diseased in the data.
2. To further explore the data we work with, create some interesting data visualizations that show whether there are interesting patterns in the data.
Hint: Think about adding a color aesthetic for the variable
Disease
.
df %>%
select(-Gender) %>%
pivot_longer(where(is.numeric)) %>%
ggplot(aes(x = value, col = Disease, fill = Disease)) +
geom_boxplot(alpha = 0.8) +
facet_wrap(~name, scales = "free") +
scale_color_brewer(palette = "Paired") +
scale_fill_brewer(palette = "Paired") +
theme_minimal()
df %>%
select(-Gender) %>%
pivot_longer(where(is.numeric)) %>%
ggplot(aes(x = value, col = Disease, fill = Disease)) +
geom_density(alpha = 0.8) +
facet_wrap(~name, scales = "free") +
scale_color_brewer(palette = "Paired") +
scale_fill_brewer(palette = "Paired") +
theme_minimal()
prop.table(table(df$Gender, df$Disease), margin = 1) %>%
as.data.frame %>%
select(Gender = Var1, Disease = Var2, `Relative Frequency` = Freq) %>%
ggplot(aes(y = `Relative Frequency`, x = Gender, col = Disease, fill = Disease)) +
geom_histogram(alpha = 0.8, stat = "identity", position = "dodge") +
scale_fill_brewer(palette = "Paired") +
scale_color_brewer(palette = "Paired") +
theme_minimal()
## Warning in geom_histogram(alpha = 0.8, stat = "identity", position = "dodge"):
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
## There are some differences between distributions for the two
## Disease categories, but the differences do not seem to be dramatic.
## Additionally, there are relatively more women with the liver
## disease than men.
3. Shortly reflect on the difference between bagging, random forests, and boosting.
## Bagging: fit a regression tree to N bootstrap samples of the training data
## take the average of all classification trees to base predictions on
## Note: out-of-bag data can serve as internal validation set.
## Random forest: Similarly to bagging, classification trees are trained on
## a bootstrap sample of the data. However, the decision trees
## are trained using less than all features in the data.
## Boosting: We build a decision tree sequentially. Given the current
## we fit a (small) tree on the residuals of the current model,
## rather than on the outcome Y
We are going to apply different machine learning models using the
caret
package.
4. Apply bagging to the training data, to predict the outcome
Disease
, using the caret
library.
Note. We first specify the internal validation settings, like so:
cvcontrol <- trainControl(method = "repeatedcv",
number = 10,
allowParallel = TRUE)
These settings can be inserted within the train
function
from the caret
package. Make sure to use the
treebag
method, to specify cvcontrol
as the
trControl
argument and to set
importance = TRUE
.
bag_train <- train(Disease ~ .,
data = df,
method = 'treebag',
trControl = cvcontrol,
importance = TRUE)
5. Interpret the variable importance measure using the
varImp
function on the trained model object.
bag_train %>%
varImp %>%
plot
6. Create training set predictions based on the bagged model,
and use the confusionMatrix()
function from the
caret
package to assess it’s performance.`
Hint: You will have to create predictions based on the trained model for the training data, and evaluate these against the observed values of the training data.
confusionMatrix(predict(bag_train, type = "raw"),
df$Disease)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Healthy Disease
## Healthy 360 0
## Disease 0 140
##
## Accuracy : 1
## 95% CI : (0.9926, 1)
## No Information Rate : 0.72
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.00
## Specificity : 1.00
## Pos Pred Value : 1.00
## Neg Pred Value : 1.00
## Prevalence : 0.72
## Detection Rate : 0.72
## Detection Prevalence : 0.72
## Balanced Accuracy : 1.00
##
## 'Positive' Class : Healthy
##
## We have achieved a perfect training set performance. However, this shows
## nothing more than that we have been able to train the model. We need to
## evaluate our
7. Now ask for the output of the bagged model. Explain why the under both approaches differ.
bag_train
## Bagged CART
##
## 500 samples
## 10 predictor
## 2 classes: 'Healthy', 'Disease'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 450, 450, 450, 450, 450, 450, ...
## Resampling results:
##
## Accuracy Kappa
## 0.678 0.1311671
We will now follow the same approach, but rather than bagging, we will train a random forest on the training data.
8. Fit a random forest to the training data to predict the
outcome Disease
, using the caret
library.
Note. Use the same cvcontrol
settings as in the
previous model.
rf_train <- train(Disease ~ .,
data = df,
method = 'rf',
trControl = cvcontrol,
importance = TRUE)
9. Again, interpret the variable importance measure using the
varImp
function on the trained model object. Do you draw
the same conclusions as under the bagged model?
rf_train %>%
varImp %>%
plot
## The random forest model indicates that other variables are more important,
## as compared to the bagged model.
10. Output the model output from the random forest. Are we doing better than with the bagged model?
rf_train
## Random Forest
##
## 500 samples
## 10 predictor
## 2 classes: 'Healthy', 'Disease'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 450, 450, 450, 450, 450, 450, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.704 0.1618056
## 6 0.688 0.1540873
## 10 0.690 0.1719310
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## Yes, the most accurate model indicates that we do just slightly better than
## with the bagged model. However, this might well be due to chance.
11. Now, fit a boosting model using the caret
library to predict disease status.`
Hint: Use gradient boosting (the gbm
method in
caret
).
gbm_train <- train(Disease ~ .,
data = df,
method = "gbm",
verbose = F,
trControl = cvcontrol)
12. Again, interpret the variable importance measure. You
will have to call for summary()
on the model object you
just created. Compare the output to the previously obtained variable
importance measures.
summary(gbm_train)
13. Output the model output from our gradient boosting procedure. Are we doing better than with the bagged and random forest model?
gbm_train
## Stochastic Gradient Boosting
##
## 500 samples
## 10 predictor
## 2 classes: 'Healthy', 'Disease'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 450, 450, 450, 450, 450, 450, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.708 0.1117547
## 1 100 0.712 0.1700777
## 1 150 0.708 0.1609721
## 2 50 0.694 0.1084389
## 2 100 0.680 0.1058856
## 2 150 0.674 0.1042949
## 3 50 0.694 0.1252223
## 3 100 0.694 0.1545466
## 3 150 0.688 0.1280794
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 100, interaction.depth =
## 1, shrinkage = 0.1 and n.minobsinnode = 10.
# Yes, our best model is doing slightly better then the previous two models.
# However, this might still be random variation.
For now, we will continue with extreme gradient boosting, although we will use a difference procedure.
We will use xgboost
to train a binary classification
model, and create some visualizations to obtain additional insight in
our model. We will create the visualizations using SHAP
(SHapley Additive
exPlanations) values, which are a measure of importance
of the variables in the model. In fact, SHAP
values
indicate the influence of each input variable on the predicted
probability for each person. Essentially, these give an indication of
the difference between the predicted probability with and without that
variable, for each person’s score.
14. Download the file shap.R
from this Github
repository.
Note. There are multiple ways to this, of which the simplest is to run the following code.
library(devtools)
source_url("https://github.com/pablo14/shap-values/blob/master/shap.R?raw=TRUE")
Additionally, you could simply go to the file shap.R
and
copy-and-paste the code into the current repository. However, you could
also fork and clone the repository, to make adjustments to the functions
that are already created.
15. Specify your model as follows, and use it to create predictions on the training data.
train_x <- model.matrix(Disease ~ ., df)[,-1]
train_y <- as.numeric(df$Disease) - 1
xgboost_train <- xgboost(data = train_x,
label = train_y,
max.depth = 10,
eta = 1,
nthread = 4,
nrounds = 4,
objective = "binary:logistic",
verbose = 2)
pred <- tibble(Disease = predict(xgboost_train, newdata = train_x)) %>%
mutate(Disease = factor(ifelse(Disease < 0.5, 1, 2),
labels = c("Healthy", "Disease")))
table(pred$Disease, df$Disease)
16. First, calculate the SHAP
rank scores for
all variables in the data, and create a variable importance plot using
these values. Interpret the plot.
shap_results <- shap.score.rank(xgboost_train,
X_train = train_x,
shap_approx = F)
## make SHAP score by decreasing order
var_importance(shap_results)
17. Plot the SHAP
values for every individual
for every feature and interpret them.
shap_long <- shap.prep(shap = shap_results,
X_train = train_x)
plot.shap.summary(shap_long)
xgb.plot.shap(train_x, features = colnames(train_x), model = xgboost_train, n_col = 3)
## The first plot shows, for example, that those with a high value for
## Direct_Bilirubin have a lower probability of being diseased. Also,
## Those with a higher age have a lower probability of being diseased,
## while those with a higher Albumin have a higher probability of being diseased.
## The second set of plots displays the marginal relationships of the SHAP values with the predictors. This conveys the same information, but in greater detail. The interpretability may be a bit tricky for the inexperienced data analyst.
18. Verify which of the models you created in this practical performs best on the test data.
test <- readRDS("data/test_disease.RDS")
bag_test <- predict(bag_train, newdata = test)
rf_test <- predict(rf_train, newdata = test)
gbm_test <- predict(gbm_train, newdata = test)
xgb_test <- predict(xgboost_train, newdata = model.matrix(Disease ~ ., test)[,-1]) %>%
factor(x = ifelse(. < 0.5, 1, 2), levels = c(1,2), labels = c("Healthy", "Disease"))
list(bag_test,
rf_test,
gbm_test,
xgb_test) %>%
map(~ confusionMatrix(.x, test$Disease))
## [[1]]
## Confusion Matrix and Statistics
##
## Reference
## Prediction Healthy Disease
## Healthy 48 16
## Disease 6 9
##
## Accuracy : 0.7215
## 95% CI : (0.6093, 0.8165)
## No Information Rate : 0.6835
## P-Value [Acc > NIR] : 0.27598
##
## Kappa : 0.2788
##
## Mcnemar's Test P-Value : 0.05501
##
## Sensitivity : 0.8889
## Specificity : 0.3600
## Pos Pred Value : 0.7500
## Neg Pred Value : 0.6000
## Prevalence : 0.6835
## Detection Rate : 0.6076
## Detection Prevalence : 0.8101
## Balanced Accuracy : 0.6244
##
## 'Positive' Class : Healthy
##
##
## [[2]]
## Confusion Matrix and Statistics
##
## Reference
## Prediction Healthy Disease
## Healthy 52 18
## Disease 2 7
##
## Accuracy : 0.7468
## 95% CI : (0.6364, 0.838)
## No Information Rate : 0.6835
## P-Value [Acc > NIR] : 0.1374222
##
## Kappa : 0.2934
##
## Mcnemar's Test P-Value : 0.0007962
##
## Sensitivity : 0.9630
## Specificity : 0.2800
## Pos Pred Value : 0.7429
## Neg Pred Value : 0.7778
## Prevalence : 0.6835
## Detection Rate : 0.6582
## Detection Prevalence : 0.8861
## Balanced Accuracy : 0.6215
##
## 'Positive' Class : Healthy
##
##
## [[3]]
## Confusion Matrix and Statistics
##
## Reference
## Prediction Healthy Disease
## Healthy 54 20
## Disease 0 5
##
## Accuracy : 0.7468
## 95% CI : (0.6364, 0.838)
## No Information Rate : 0.6835
## P-Value [Acc > NIR] : 0.1374
##
## Kappa : 0.2547
##
## Mcnemar's Test P-Value : 2.152e-05
##
## Sensitivity : 1.0000
## Specificity : 0.2000
## Pos Pred Value : 0.7297
## Neg Pred Value : 1.0000
## Prevalence : 0.6835
## Detection Rate : 0.6835
## Detection Prevalence : 0.9367
## Balanced Accuracy : 0.6000
##
## 'Positive' Class : Healthy
##
##
## [[4]]
## Confusion Matrix and Statistics
##
## Reference
## Prediction Healthy Disease
## Healthy 44 14
## Disease 10 11
##
## Accuracy : 0.6962
## 95% CI : (0.5825, 0.7947)
## No Information Rate : 0.6835
## P-Value [Acc > NIR] : 0.4577
##
## Kappa : 0.2663
##
## Mcnemar's Test P-Value : 0.5403
##
## Sensitivity : 0.8148
## Specificity : 0.4400
## Pos Pred Value : 0.7586
## Neg Pred Value : 0.5238
## Prevalence : 0.6835
## Detection Rate : 0.5570
## Detection Prevalence : 0.7342
## Balanced Accuracy : 0.6274
##
## 'Positive' Class : Healthy
##
When you have finished the practical,
enclose all files of the project (i.e. all .R
and/or
.Rmd
files including the one with your answers, and the
.Rproj
file) in a zip file, and
hand in the zip here. Do so before next week’s lecture.