In this practical, you will learn how to handle many variables with regression by using variable selection techniques, and how to tune hyperparameters for these techniques. This practical has been derived from chapter 6 of ISLR.
One of the packages we are going to use is glmnet
. For
this, you will probably need to install.packages("glmnet")
before running the library()
functions.
To get replicable results, it is always wise to set a seed when relying on random processes.
Our goal for today is to use the Hitters
dataset from
the ISLR
package to predict Salary
.
baseball
from the
Hitters
dataset where you remove the baseball players for
which the Salary
is missing. How many baseball players are
left?## [1] 263
baseball_train
(50%),
baseball_valid
(30%), and baseball_test
(20%)
datasets.split <- c(rep("train", 132), rep("valid", 79), rep("test", 52))
baseball <- baseball %>% mutate(split = sample(split))
baseball_train <- baseball %>% filter(split == "train")
baseball_valid <- baseball %>% filter(split == "valid")
baseball_test <- baseball %>% filter(split == "test")
lm_mse()
with as
inputs (1) a formula, (2) a training dataset, and (3) a test dataset
which outputs the mse on the test dataset for predictions from a linear
model.Start like this:
lm_mse <- function(formula, train_data, valid_data) {
y_name <- as.character(formula)[2]
y_true <- valid_data[[y_name]]
lm_fit <- lm(formula, train_data)
y_pred <- predict(lm_fit, newdata = valid_data)
mean((y_true - y_pred)^2)
}
Salary ~ Hits + Runs
, using baseball_train
and
baseball_valid
.## [1] 142631.1
We have pre-programmed a function for you to generate as a character
vector all formulas with a set number of p
variables. You can load the function into your environment by
sourcing the .R
file it is written in:
You can use it like so:
Hitters
dataset. colnames()
may be of
help. Note that Salary
is not a predictor
variable.Salary
and 3 predictors from the Hitters
data. Assign this to a
variable called formulas
. There should be 969 elements in
this vector.## [1] 969
for loop
to find the best set of 3
predictors in the Hitters
dataset based on MSE. Use the
baseball_train
and baseball_valid
datasets.# Initialise a vector we will fill with MSE values
mses <- rep(0, 969)
# loop over all the formulas
for (i in 1:969) {
mses[i] <- lm_mse(as.formula(formulas[i]), baseball_train, baseball_valid)
}
# select the formula with the lowest MSE
best_3_preds <- formulas[which.min(mses)]
# Generate formulas
formulas_1 <- generate_formulas(p = 1, x_vars = x_vars, y_var = "Salary")
formulas_2 <- generate_formulas(p = 2, x_vars = x_vars, y_var = "Salary")
formulas_4 <- generate_formulas(p = 4, x_vars = x_vars, y_var = "Salary")
# Initialise a vector we will fill with MSE values
mses_1 <- rep(0, length(formulas_1))
mses_2 <- rep(0, length(formulas_2))
mses_4 <- rep(0, length(formulas_4))
# loop over all the formulas
for (i in 1:length(formulas_1)) {
mses_1[i] <- lm_mse(as.formula(formulas_1[i]), baseball_train, baseball_valid)
}
for (i in 1:length(formulas_2)) {
mses_2[i] <- lm_mse(as.formula(formulas_2[i]), baseball_train, baseball_valid)
}
for (i in 1:length(formulas_4)) {
mses_4[i] <- lm_mse(as.formula(formulas_4[i]), baseball_train, baseball_valid)
}
# Compare mses
min(mses_1)
min(mses_2)
min(mses)
min(mses_4)
# min(mses_4) is lowest of them all!
# So let's see which model that is
formulas_4[which.min(mses_4)]
## [1] 111606.5
## [1] 102359.1
## [1] 97916.01
## [1] 97461.05
## [1] "Salary ~ Years + CAtBat + CHits + PutOuts"
baseball_test
.# Estimate model and calculate mse
lm_best <- lm(Salary ~ Walks + CAtBat + CHits + CRBI, baseball_train)
mse <- function(y_true, y_pred) mean((y_true - y_pred)^2)
mse(baseball_test$Salary, predict(lm_best, newdata = baseball_test))
## [1] 103973.4
# create a plot
tibble(
y_true = baseball_test$Salary,
y_pred = predict(lm_best, newdata = baseball_test)
) %>%
ggplot(aes(x = y_pred, y = y_true)) +
geom_abline(slope = 1, intercept = 0, lty = 2) +
geom_point() +
theme_minimal()
Through enumerating all possibilities, we have selected the best subset of at most 4 non-interacting predictors for the prediction of baseball salaries. This method works well for few predictors, but the computational cost of enumeration increases quickly to the point where it is infeasible to enumerate all combinations of variables:
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
glmnet
is a package that implements efficient (quick!)
algorithms for LASSO and ridge regression, among other things.
glmnet
. We are
going to perform a linear regression with normal (gaussian) error terms.
What format should our data be in?# We need to input a predictor matrix x and a response (outcome) variable y,
# as well as a family = "gaussian"
Again, we will try to predict baseball salary, this time using all the available variables and using the LASSO penalty to perform subset selection. For this, we first need to generate an input matrix.
x_train
looks like
what you would expect.## (Intercept) AtBat Hits HmRun Runs RBI Walks Years CAtBat
## -Alan Ashby 1 315 81 7 24 38 39 14 3449
## -Andre Dawson 1 496 141 20 65 78 37 11 5628
## -Andres Galarraga 1 321 87 10 39 42 30 2 396
## -Alfredo Griffin 1 594 169 4 74 51 35 11 4408
## -Al Newman 1 185 37 1 23 8 21 2 214
## -Argenis Salazar 1 298 73 0 24 24 7 3 509
## CHits CHmRun CRuns CRBI CWalks LeagueN DivisionW PutOuts
## -Alan Ashby 835 69 321 414 375 1 1 632
## -Andre Dawson 1575 225 828 838 354 1 0 200
## -Andres Galarraga 101 12 48 46 33 1 0 805
## -Alfredo Griffin 1133 19 501 336 194 0 1 282
## -Al Newman 42 1 30 9 24 1 0 76
## -Argenis Salazar 108 0 41 37 12 0 1 121
## Assists Errors NewLeagueN
## -Alan Ashby 43 10 1
## -Andre Dawson 11 3 1
## -Andres Galarraga 40 4 1
## -Alfredo Griffin 421 25 0
## -Al Newman 127 7 0
## -Argenis Salazar 283 9 0
The model.matrix()
function takes a dataset and a
formula and outputs the predictor matrix where the categorical variables
have been correctly transformed into dummy variables, and it adds an
intercept. It is used internally by the lm()
function as
well!
glmnet()
, perform a LASSO regression with
the generated x_train
as the predictor matrix and
Salary
as the response variable. Set the
lambda
parameter of the penalty to 15. NB: Remove the
intercept column from the x_matrix
– glmnet
adds an intercept internally.result <- glmnet(x = x_train[, -1], # X matrix without intercept
y = baseball_train$Salary, # Salary as response
family = "gaussian", # Normally distributed errors
alpha = 1, # LASSO penalty
lambda = 15) # Penalty value
beta
element of the list generated by the
glmnet()
function. Which variables have been selected? You
may use the coef()
function.## [1] "(Intercept)" "Hits" "Runs" "Walks" "CHmRun"
## [6] "CRuns" "CRBI" "DivisionW" "PutOuts" "Assists"
## [11] "NewLeagueN"
baseball_valid
data. Use the
predict()
function for this! What is the MSE on the
validation set?x_valid <- model.matrix(Salary ~ ., data = baseball_valid %>% select(-split))[, -1]
y_pred <- as.numeric(predict(result, newx = x_valid))
tibble(Predicted = y_pred, Observed = baseball_valid$Salary) %>%
ggplot(aes(x = Predicted, y = Observed)) +
geom_point() +
geom_abline(slope = 1, intercept = 0, lty = 2) +
theme_minimal() +
labs(title = "Predicted versus observed salary")
## [1] 131955.2
Like many methods of analysis, regularised regression has a
tuning parameter. In the previous section, we’ve set this
parameter to 15. The lambda
parameter changes the strength
of the shrinkage in glmnet()
. Changing the tuning parameter
will change the predictions, and thus the MSE. In this section, we will
select the tuning parameter based on out-of-sample MSE.
lambda
value. What is different
about the object that is generated? Hint: use the coef()
and plot()
methods on the resulting object.result_nolambda <- glmnet(x = x_train[, -1], y = baseball_train$Salary,
family = "gaussian", alpha = 1)
# This object contains sets of coefficients for different values of lambda,
# i.e., different models ranging from an intercept-only model (very high
# lambda) to almost no shrinkage (very low lambda).
plot(result_nolambda)
For deciding which value of lambda to choose, we could work similarly
to what we have don in the best subset selection section before.
However, the glmnet
package includes another method for
this task: cross validation.
cv.glmnet
function to determine the
lambda
value for which the out-of-sample MSE is lowest
using 15-fold cross validation. As your dataset, you may use the
training and validation sets bound together with bind_rows(). What is
the best lambda value?x_cv <- model.matrix(Salary ~ ., bind_rows(baseball_train, baseball_valid)[, -21])[, -1]
result_cv <- cv.glmnet(x = x_cv, y = c(baseball_train$Salary, baseball_valid$Salary), nfolds = 15)
best_lambda <- result_cv$lambda.min
best_lambda
## [1] 1.774744
# the MSE is high with very small values of lambda (no shrinkage) and
# with very large values of lambda (intercept-only model).
# introducing a bit of bias lowers the variance relatively strongly
# (fewer variables in the model) and therefore the MSE is reduced.
predict()
method directly on the object
you just created to predict new salaries for the baseball players in the
baseball_test
dataset using the best lambda value you just
created (hint: you need to use the s
argument, look at
?predict.cv.glmnet
for help). Create another
predicted-observed scatter plot.x_test <- model.matrix(Salary ~ ., data = baseball_test %>% select(-split))[, -1]
y_pred <- as.numeric(predict(result_cv, newx = x_test, s = best_lambda))
tibble(Predicted = y_pred, Observed = baseball_test$Salary) %>%
ggplot(aes(x = Predicted, y = Observed)) +
geom_point() +
geom_abline(slope = 1, intercept = 0, lty = 2) +
theme_minimal() +
labs(title = "Predicted versus observed salary: LASSO with cv tuning")
## [1] 107147.6
baseball_train
and
baseball_valid
# create this new training dataset
train_data <- bind_rows(baseball_train, baseball_valid)[, -21]
# generate predictions from the models
y_pred_ols <- predict(lm(Salary ~ ., data = train_data), newdata = baseball_test)
y_pred_sub <- predict(lm(Salary ~ Runs + CHits + Division + PutOuts, data = train_data),
newdata = baseball_test)
# these two use x_cv and x_test from the previous exercises
y_pred_las <- as.numeric(predict(glmnet(x_cv, train_data$Salary, lambda = 50), newx = x_test))
y_pred_cv <- as.numeric(predict(result_cv, newx = x_test, s = best_lambda))
# Calculate MSEs
mses <- c(
mse(baseball_test$Salary, y_pred_ols),
mse(baseball_test$Salary, y_pred_sub),
mse(baseball_test$Salary, y_pred_las),
mse(baseball_test$Salary, y_pred_cv)
)
# Create a plot
tibble(Method = as_factor(c("lm", "subset", "lasso", "cv_las")), MSE = mses) %>%
ggplot(aes(x = Method, y = MSE, fill = Method)) +
geom_bar(stat = "identity", col = "black") +
theme_minimal() +
theme(legend.position = "none") +
labs(title = "Comparison of test set MSE for different prediction methods") +
scale_fill_viridis_d() # different colour scale
When you have finished the practical,
enclose all files of the project (i.e. all .R
and/or
.Rmd
files including the one with your answers, and the
.Rproj
file) in a zip file, and
hand in the zip here. Do so before next week’s lecture.