In this practical, you will learn how to perform regression analysis, how to plot with confidence and prediction intervals, how to calculate MSE, perform train-test splits, and write a function for cross validation.
Just like in the practical at the end of chapter 3 of the ISLR book,
we will use the Boston
dataset, which is in the
MASS
package that comes with R
.
R
Regression is performed through the lm()
function. It
requires two arguments: a formula
and data
. A
formula
is a specific type of object that can be
constructed like so:
You can read it as “the outcome variable is a function of predictors 1 and 2”. As with other objects, you can check its class and even convert it to other classes, such as a character vector:
## [1] "formula"
## [1] "~" "outcome"
## [3] "predictor_1 + predictor_2"
You can estimate a linear model using lm()
by specifying
the outcome variable and the predictors in a formula and by inputting
the dataset these variables should be taken from.
lm_ses
using the formula medv ~ lstat
and the Boston
dataset.You have now trained a regression model with medv
(housing value) as the outcome/dependent variable and lstat
(socio-economic status) as the predictor / independent variable.
Remember that a regression estimates \(\beta_0\) (the intercept) and \(\beta_1\) (the slope) in the following equation:
\[\boldsymbol{y} = \beta_0 + \beta_1\cdot \boldsymbol{x}_1 + \boldsymbol{\epsilon}\]
coef()
to extract the
intercept and slope from the lm_ses
object. Interpret the
slope coefficient.summary()
to get a summary of the
lm_ses
object. What do you see? You can use the help file
?summary.lm
.We now have a model object lm_ses
that represents the
formula
\[\text{medv}_i = 34.55 - 0.95 * \text{lstat}_i + \epsilon_i\]
With this object, we can predict a new medv
value by
inputting its lstat
value. The predict()
method enables us to do this for the lstat
values in the
original dataset.
y_pred
y_pred
mapped to the
x position and the true y value (Boston$medv
) mapped to the
y value. What do you see? What would this plot look like if the fit were
perfect?We can also generate predictions from new data using the
newdat
argument in the predict()
method. For
that, we need to prepare a data frame with new values for the original
predictors.
seq()
function to generate a sequence
of 1000 equally spaced values from 0 to 40. Store this vector in a data
frame with (data.frame()
or tibble()
) as its
column name lstat
. Name the data frame
pred_dat
.newdata
argument to a predict()
call for lm_ses
. Store
it in a variable named y_pred_new
.ggplot
A good way of understanding your model is by visualising it. We are
going to walk through the construction of a plot with a fit line and
prediction / confidence intervals from an lm
object.
Boston
dataset
with lstat
mapped to the x position and medv
mapped to the y position. Store the plot in an object called
p_scatter
.Now we’re going to add a prediction line to this plot.
y_pred_new
to the
pred_dat
data frame with the name
medv
.p_scatter
, with
pred_dat
as the data
argument. What does this
line represent?interval
argument can be used to generate
confidence or prediction intervals. Create a new object called
y_pred_95
using predict()
(again with the
pred_dat
data) with the interval
argument set
to “confidence”. What is in this object?medv
,
lstat
, lower
, and
upper
.geom_ribbon()
to the plot with the data
frame you just made. The ribbon geom requires three aesthetics:
x
(lstat
, already mapped), ymin
(lower
), and ymax
(upper
). Add
the ribbon below the geom_line()
and the
geom_points()
of before to make sure those remain visible.
Give it a nice colour and clean up the plot, too!mse()
that takes in two
vectors: true y values and predicted y values, and which outputs the
mean square error.Start like so:
Wikipedia may help for the formula.
mse()
function works correctly
by running the following code.## [1] 33
You have now calculated the mean squared length of the dashed lines below.
lm_ses
model. Use the medv
column as y_true
and use
the predict()
method to generate
y_pred
.You have calculated the mean squared length of the dashed lines in the plot below.
Now we will use the sample()
function to randomly select
observations from the Boston
dataset to go into a training,
test, and validation set. The training set will be used to fit our
model, the validation set will be used to calculate the out-of sample
prediction error during model building, and the test set will be used to
estimate the true out-of-sample MSE.
Boston
dataset has 506 observations. Use
c()
and rep()
to create a vector with 253
times the word “train”, 152 times the word “validation”, and 101 times
the word “test”. Call this vector splits
.sample()
to randomly order
this vector and add it to the Boston
dataset using
mutate()
. Assign the newly created dataset to a variable
called boston_master
.filter()
to create a training,
validation, and test set from the boston_master
data. Call
these datasets boston_train
, boston_valid
, and
boston_test
.We will set aside the boston_test
dataset for now.
model_1
using the training dataset. Use the formula medv ~ lstat
like in the first lm()
exercise. Use summary()
to check that this object is as you expect.model_1_mse_train
.model_1_mse_valid
. Hint: use the
newdata
argument in predict()
.This is the estimated out-of-sample mean squared error.
model_2
for the train
data which includes age
and tax
as predictors.
Calculate the train and validation MSE.This is an advanced exercise. Some components we have seen before in
this and previous practicals, but some things will be completely new.
Try to complete it by yourself, but don’t worry if you get stuck. If you
don’t know about for loops
in R
, read up on
those before you start the exercise.
Use help in this order:
You may also just read the answer and try to understand what happens in each step.
Inputs:
formula
: a formula just as in the lm()
functiondataset
: a data framek
: the number of folds for cross validationOutputs:
medv ~ lstat + age + tax
.
Compare it to a model with as formulat
medv ~ lstat + I(lstat^2) + age + tax
.When you have finished the practical,
enclose all files of the project (i.e. all .R
and/or
.Rmd
files including the one with your answers, and the
.Rproj
file) in a zip file, and
hand in the zip here. Do so before next week’s lecture.