Introduction

Today, we will learn how to use different ensemble methods in R, recap on how to evaluate the performance of the methods, and learn how we can substantively interpret the model output.

In this practical we will work with the ILPD (Indian Liver Patient Dataset) from the UCI Machine Learning Repository (you can find the data here). This data set contains data on 414 liver disease patients, and 165 non-patients. In general, medical researchers have two distinct goals when doing research: (1) to be able to classify people in their waiting room as either patients or non-patients, and (2) get insight into the factors that are associated with the disease. In this practical, we will look at both aspects.

In this practical, we will use the tidyverse, magrittr, psych, GGally and caret libraries.

library(tidyverse)
library(magrittr)
library(psych)
library(caret)
library(gbm)
library(xgboost)
library(data.table)
library(ggforce)

First, we specify a seed and load the training data. We will use this data to make inferences and to train a prediction model.

set.seed(45)
df <- readRDS("data/train_disease.RDS")

1. Get an impression of the data by looking at the structure of the data and creating some descriptive statistics.

head(df)

tail(df)

df %>%
  select(-c(Gender, Disease)) %>%
  describeBy(df$Disease, fast = TRUE)

## 
##  Descriptive statistics by group 
## group: Healthy
##                            vars   n   mean     sd  min    max  range    se
## Age                           1 360  45.81  15.60 10.0   90.0   80.0  0.82
## Total_Bilirubin               2 360   4.05   7.15  0.4   75.0   74.6  0.38
## Direct_Bilirubin              3 360   1.86   3.16  0.1   19.7   19.6  0.17
## Alkaline_Phosphotase          4 360 309.47 250.37 63.0 2110.0 2047.0 13.20
## Alamine_Aminotransferase      5 360  97.19 200.17 12.0 1680.0 1668.0 10.55
## Aspartate_Aminotransferase    6 360 133.66 324.59 11.0 4929.0 4918.0 17.11
## Total_Protiens                7 360   6.47   1.11  2.7    9.6    6.9  0.06
## Albumin                       8 360   3.07   0.80  0.9    5.5    4.6  0.04
## Ratio_Albumin_Globulin        9 360   0.92   0.32  0.3    2.8    2.5  0.02
## ------------------------------------------------------------ 
## group: Disease
##                            vars   n   mean     sd   min    max   range    se
## Age                           1 140  41.51  16.98  4.00   85.0   81.00  1.43
## Total_Bilirubin               2 140   1.14   1.01  0.50    7.3    6.80  0.09
## Direct_Bilirubin              3 140   0.40   0.53  0.10    3.6    3.50  0.04
## Alkaline_Phosphotase          4 140 220.56 149.76 90.00 1580.0 1490.00 12.66
## Alamine_Aminotransferase      5 140  33.31  24.50 10.00  181.0  171.00  2.07
## Aspartate_Aminotransferase    6 140  39.79  34.79 10.00  285.0  275.00  2.94
## Total_Protiens                7 140   6.53   1.07  3.70    9.2    5.50  0.09
## Albumin                       8 140   3.34   0.80  1.40    5.0    3.60  0.07
## Ratio_Albumin_Globulin        9 140   1.04   0.30  0.37    1.9    1.53  0.03

# It becomes directly clear that there are substantial differences between
# the diseased and non-diseased in the data.

2. To further explore the data we work with, create some interesting data visualizations that show whether there are interesting patterns in the data.

Hint: Think about adding a color aesthetic for the variable Disease.

df %>%
  select(-Gender) %>%
  pivot_longer(where(is.numeric)) %>%
  ggplot(aes(x = value, col = Disease, fill = Disease)) +
  geom_boxplot(alpha = 0.8) +
  facet_wrap(~name, scales = "free") +
  scale_color_brewer(palette = "Paired") +
  scale_fill_brewer(palette = "Paired") +
  theme_minimal()

df %>%
  select(-Gender) %>%
  pivot_longer(where(is.numeric)) %>%
  ggplot(aes(x = value, col = Disease, fill = Disease)) +
  geom_density(alpha = 0.8) +
  facet_wrap(~name, scales = "free") +
  scale_color_brewer(palette = "Paired") +
  scale_fill_brewer(palette = "Paired") +
  theme_minimal()

prop.table(table(df$Gender, df$Disease), margin = 1) %>%
  as.data.frame %>%
  select(Gender = Var1, Disease = Var2, `Relative Frequency` = Freq) %>%
  ggplot(aes(y = `Relative Frequency`, x = Gender, col = Disease, fill = Disease)) +
  geom_histogram(alpha = 0.8, stat = "identity", position = "dodge") +
  scale_fill_brewer(palette = "Paired") +
  scale_color_brewer(palette = "Paired") +
  theme_minimal()

## Warning in geom_histogram(alpha = 0.8, stat = "identity", position = "dodge"):
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`

## There are some differences between distributions for the two
## Disease categories, but the differences do not seem to be dramatic.
## Additionally, there are relatively more women with the liver 
## disease than men.

3. Shortly reflect on the difference between bagging, random forests, and boosting.

## Bagging:       fit a regression tree to N bootstrap samples of the training data
##                take the average of all classification trees to base predictions on
##                Note: out-of-bag data can serve as internal validation set.

## Random forest: Similarly to bagging, classification trees are trained on 
##                a bootstrap sample of the data. However, the decision trees
##                are trained using less than all features in the data. 

## Boosting:      We build a decision tree sequentially. Given the current
##                we fit a (small) tree on the residuals of the current model, 
##                rather than on the outcome Y

We are going to apply different machine learning models using the caret package.

4. Apply bagging to the training data, to predict the outcome Disease, using the caret library.

Note. We first specify the internal validation settings, like so:

cvcontrol <- trainControl(method = "repeatedcv", 
                          number = 10,
                          allowParallel = TRUE)

These settings can be inserted within the train function from the caret package. Make sure to use the treebag method, to specify cvcontrol as the trControl argument and to set importance = TRUE.

bag_train <- train(Disease ~ .,
                   data = df, 
                   method = 'treebag',
                   trControl = cvcontrol,
                   importance = TRUE)

5. Interpret the variable importance measure using the varImp function on the trained model object.

bag_train %>%
  varImp %>%
  plot

6. Create training set predictions based on the bagged model, and use the confusionMatrix() function from the caret package to assess it’s performance.`

Hint: You will have to create predictions based on the trained model for the training data, and evaluate these against the observed values of the training data.

confusionMatrix(predict(bag_train, type = "raw"),
                df$Disease)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Healthy Disease
##    Healthy     360       0
##    Disease       0     140
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9926, 1)
##     No Information Rate : 0.72       
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.00       
##             Specificity : 1.00       
##          Pos Pred Value : 1.00       
##          Neg Pred Value : 1.00       
##              Prevalence : 0.72       
##          Detection Rate : 0.72       
##    Detection Prevalence : 0.72       
##       Balanced Accuracy : 1.00       
##                                      
##        'Positive' Class : Healthy    
##

## We have achieved a perfect training set performance. However, this shows
## nothing more than that we have been able to train the model. We need to
## evaluate our

7. Now ask for the output of the bagged model. Explain why the under both approaches differ.

bag_train

## Bagged CART 
## 
## 500 samples
##  10 predictor
##   2 classes: 'Healthy', 'Disease' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 450, 450, 450, 450, 450, 450, ... 
## Resampling results:
## 
##   Accuracy  Kappa    
##   0.678     0.1311671

We will now follow the same approach, but rather than bagging, we will train a random forest on the training data.

8. Fit a random forest to the training data to predict the outcome Disease, using the caret library.

Note. Use the same cvcontrol settings as in the previous model.

rf_train <- train(Disease ~ .,
                  data = df, 
                  method = 'rf',
                  trControl = cvcontrol,
                  importance = TRUE)

9. Again, interpret the variable importance measure using the varImp function on the trained model object. Do you draw the same conclusions as under the bagged model?

rf_train %>%
  varImp %>%
  plot

## The random forest model indicates that other variables are more important,
## as compared to the bagged model.

10. Output the model output from the random forest. Are we doing better than with the bagged model?

rf_train

## Random Forest 
## 
## 500 samples
##  10 predictor
##   2 classes: 'Healthy', 'Disease' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 450, 450, 450, 450, 450, 450, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy  Kappa    
##    2    0.704     0.1618056
##    6    0.688     0.1540873
##   10    0.690     0.1719310
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

## Yes, the most accurate model indicates that we do just slightly better than
## with the bagged model. However, this might well be due to chance.

11. Now, fit a boosting model using the caret library to predict disease status.`

Hint: Use gradient boosting (the gbm method in caret).

gbm_train <- train(Disease ~ .,
                   data = df,
                   method = "gbm",
                   verbose = F,
                   trControl = cvcontrol)

12. Again, interpret the variable importance measure. You will have to call for summary() on the model object you just created. Compare the output to the previously obtained variable importance measures.

summary(gbm_train)

13. Output the model output from our gradient boosting procedure. Are we doing better than with the bagged and random forest model?

gbm_train

## Stochastic Gradient Boosting 
## 
## 500 samples
##  10 predictor
##   2 classes: 'Healthy', 'Disease' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 450, 450, 450, 450, 450, 450, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy  Kappa    
##   1                   50      0.708     0.1117547
##   1                  100      0.712     0.1700777
##   1                  150      0.708     0.1609721
##   2                   50      0.694     0.1084389
##   2                  100      0.680     0.1058856
##   2                  150      0.674     0.1042949
##   3                   50      0.694     0.1252223
##   3                  100      0.694     0.1545466
##   3                  150      0.688     0.1280794
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 100, interaction.depth =
##  1, shrinkage = 0.1 and n.minobsinnode = 10.

# Yes, our best model is doing slightly better then the previous two models.
# However, this might still be random variation.

For now, we will continue with extreme gradient boosting, although we will use a difference procedure.

We will use xgboost to train a binary classification model, and create some visualizations to obtain additional insight in our model. We will create the visualizations using SHAP (SHapley Additive exPlanations) values, which are a measure of importance of the variables in the model. In fact, SHAP values indicate the influence of each input variable on the predicted probability for each person. Essentially, these give an indication of the difference between the predicted probability with and without that variable, for each person’s score.

14. Download the file shap.R from this Github repository.

Note. There are multiple ways to this, of which the simplest is to run the following code.

library(devtools)
source_url("https://github.com/pablo14/shap-values/blob/master/shap.R?raw=TRUE")

Additionally, you could simply go to the file shap.R and copy-and-paste the code into the current repository. However, you could also fork and clone the repository, to make adjustments to the functions that are already created.

15. Specify your model as follows, and use it to create predictions on the training data.

train_x <- model.matrix(Disease ~ ., df)[,-1]
train_y <- as.numeric(df$Disease) - 1
xgboost_train <- xgboost(data = train_x,
                         label = train_y, 
                         max.depth = 10,
                         eta = 1,
                         nthread = 4,
                         nrounds = 4,
                         objective = "binary:logistic",
                         verbose = 2)



pred <- tibble(Disease = predict(xgboost_train, newdata = train_x)) %>%
  mutate(Disease = factor(ifelse(Disease < 0.5, 1, 2),
                          labels = c("Healthy", "Disease")))

table(pred$Disease, df$Disease)

16. First, calculate the SHAP rank scores for all variables in the data, and create a variable importance plot using these values. Interpret the plot.

shap_results <- shap.score.rank(xgboost_train,
                                X_train = train_x,
                                shap_approx = F)

## make SHAP score by decreasing order

var_importance(shap_results)

17. Plot the SHAP values for every individual for every feature and interpret them.

shap_long <- shap.prep(shap = shap_results,
                       X_train = train_x)

plot.shap.summary(shap_long)

xgb.plot.shap(train_x, features = colnames(train_x), model = xgboost_train, n_col = 3)

## The first plot shows, for example, that those with a high value for 
## Direct_Bilirubin have a lower probability of being diseased. Also,
## Those with a higher age have a lower probability of being diseased,
## while those with a higher Albumin have a higher probability of being diseased.

## The second set of plots displays the marginal relationships of the SHAP values with the predictors. This conveys the same information, but in greater detail. The interpretability may be a bit tricky for the inexperienced data analyst.

18. Verify which of the models you created in this practical performs best on the test data.

test <- readRDS("data/test_disease.RDS")

bag_test <- predict(bag_train, newdata = test)
rf_test  <- predict(rf_train, newdata = test)
gbm_test <- predict(gbm_train, newdata = test)
xgb_test <- predict(xgboost_train, newdata = model.matrix(Disease ~ ., test)[,-1]) %>%
  factor(x = ifelse(. < 0.5, 1, 2), levels = c(1,2), labels = c("Healthy", "Disease"))

list(bag_test, 
     rf_test, 
     gbm_test, 
     xgb_test) %>%
  map(~ confusionMatrix(.x, test$Disease))

## [[1]]
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Healthy Disease
##    Healthy      48      16
##    Disease       6       9
##                                           
##                Accuracy : 0.7215          
##                  95% CI : (0.6093, 0.8165)
##     No Information Rate : 0.6835          
##     P-Value [Acc > NIR] : 0.27598         
##                                           
##                   Kappa : 0.2788          
##                                           
##  Mcnemar's Test P-Value : 0.05501         
##                                           
##             Sensitivity : 0.8889          
##             Specificity : 0.3600          
##          Pos Pred Value : 0.7500          
##          Neg Pred Value : 0.6000          
##              Prevalence : 0.6835          
##          Detection Rate : 0.6076          
##    Detection Prevalence : 0.8101          
##       Balanced Accuracy : 0.6244          
##                                           
##        'Positive' Class : Healthy         
##                                           
## 
## [[2]]
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Healthy Disease
##    Healthy      52      18
##    Disease       2       7
##                                          
##                Accuracy : 0.7468         
##                  95% CI : (0.6364, 0.838)
##     No Information Rate : 0.6835         
##     P-Value [Acc > NIR] : 0.1374222      
##                                          
##                   Kappa : 0.2934         
##                                          
##  Mcnemar's Test P-Value : 0.0007962      
##                                          
##             Sensitivity : 0.9630         
##             Specificity : 0.2800         
##          Pos Pred Value : 0.7429         
##          Neg Pred Value : 0.7778         
##              Prevalence : 0.6835         
##          Detection Rate : 0.6582         
##    Detection Prevalence : 0.8861         
##       Balanced Accuracy : 0.6215         
##                                          
##        'Positive' Class : Healthy        
##                                          
## 
## [[3]]
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Healthy Disease
##    Healthy      54      20
##    Disease       0       5
##                                          
##                Accuracy : 0.7468         
##                  95% CI : (0.6364, 0.838)
##     No Information Rate : 0.6835         
##     P-Value [Acc > NIR] : 0.1374         
##                                          
##                   Kappa : 0.2547         
##                                          
##  Mcnemar's Test P-Value : 2.152e-05      
##                                          
##             Sensitivity : 1.0000         
##             Specificity : 0.2000         
##          Pos Pred Value : 0.7297         
##          Neg Pred Value : 1.0000         
##              Prevalence : 0.6835         
##          Detection Rate : 0.6835         
##    Detection Prevalence : 0.9367         
##       Balanced Accuracy : 0.6000         
##                                          
##        'Positive' Class : Healthy        
##                                          
## 
## [[4]]
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Healthy Disease
##    Healthy      44      14
##    Disease      10      11
##                                           
##                Accuracy : 0.6962          
##                  95% CI : (0.5825, 0.7947)
##     No Information Rate : 0.6835          
##     P-Value [Acc > NIR] : 0.4577          
##                                           
##                   Kappa : 0.2663          
##                                           
##  Mcnemar's Test P-Value : 0.5403          
##                                           
##             Sensitivity : 0.8148          
##             Specificity : 0.4400          
##          Pos Pred Value : 0.7586          
##          Neg Pred Value : 0.5238          
##              Prevalence : 0.6835          
##          Detection Rate : 0.5570          
##    Detection Prevalence : 0.7342          
##       Balanced Accuracy : 0.6274          
##                                           
##        'Positive' Class : Healthy         
##

Hand-in

When you have finished the practical,

enclose all files of the project (i.e. all .R and/or .Rmd files including the one with your answers, and the .Rproj file) in a zip file, and
hand in the zip here. Do so before next week’s lecture.

Ensemble methods

Introduction

Hand-in