In this practical, we will learn about nonlinear extensions to regression using basis functions and how to create, visualise, and interpret them. Parts of it are adapted from the practicals in ISLR chapter 7.
One of the packages we are going to use is splines
. For
this, you will probably need to install.packages("splines")
before running the library()
functions.
library(MASS)
library(splines)
library(ISLR)
library(tidyverse)
First, we specify a seed as usual.
set.seed(45)
Median housing prices in Boston do not have a linear relation with the proportion of low SES households. Today we are going to focus exclusively on prediction.
Boston %>%
ggplot(aes(x = lstat, y = medv)) +
geom_point() +
theme_minimal()
First, we need a way of visualising the predictions.
pred_plot()
that takes
as input an lm
object, which outputs the above plot but
with a prediction line generated from the model object using the
predict()
method.pred_plot <- function(model) {
# First create predictions for all values of lstat
x_pred <- seq(min(Boston$lstat), max(Boston$lstat), length.out = 500)
y_pred <- predict(model, newdata = tibble(lstat = x_pred))
# Then create a ggplot object with a line based on those predictions
Boston %>%
ggplot(aes(x = lstat, y = medv)) +
geom_point() +
geom_line(data = tibble(lstat = x_pred, medv = y_pred), size = 1, col = "blue") +
theme_minimal()
}
lin_mod
which models medv
as a function of
lstat
. Check if your prediction plot works by running
pred_plot(lin_mod)
. Do you see anything out of the ordinary
with the predictions?lin_mod <- lm(medv ~ lstat, data = Boston)
pred_plot(lin_mod)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# the predicted median housing value is below 0 for high values.
The first extension to linear regression is polynomial regression, with basis functions \(b_j(x_i) = x_i^j\) (ISLR, p. 270).
pn3_mod
, where you
add the second and third-degree polynomial terms I(lstat^2)
and I(lstat^3)
to the formula. Create a
pred_plot()
with this model.pn3_mod <- lm(medv ~ lstat + I(lstat^2) + I(lstat^3), data = Boston)
pred_plot(pn3_mod)
The function poly()
can automatically generate a matrix
which contains columns with polynomial basis function outputs.
degree = 3
and
raw = TRUE
?poly(1:5, degree = 3, raw = TRUE)
## 1 2 3
## [1,] 1 1 1
## [2,] 2 4 8
## [3,] 3 9 27
## [4,] 4 16 64
## [5,] 5 25 125
## attr(,"degree")
## [1] 1 2 3
## attr(,"class")
## [1] "poly" "matrix"
# these are the original column (1:5), then squared, and then cubed.
medv
using lstat
. Compare the prediction plot to the previous
prediction plot you made. What happens if you change the poly() function
to raw = FALSE
?pn3_mod2 <- lm(medv ~ poly(lstat, 3, raw = TRUE), data = Boston)
pred_plot(pn3_mod2)
# The plot is exactly the same as the previous prediction plot
# By default, with raw = FALSE, poly() computes an orthogonal polynomial.
# This does not change the fitted values but you can see whether a certain order in the polynomial significantly improves the regression over the lower orders.
# You can check the summary(pn3_mod2).
Another basis function we can use is a step function. For example, we
can split the lstat
variable into two groups based on its
median and take the average of these groups to predict
medv
.
pw2_mod
with one
predictor: I(lstat <= median(lstat))
. Create a pred_plot
with this model. Use the coefficients in coef(pw2_mod)
to
find out what the predicted value for a low-lstat neighbourhood
is.pw2_mod <- lm(medv ~ I(lstat <= median(lstat)), data = Boston)
pred_plot(pw2_mod)
coef(pw2_mod)
## (Intercept) I(lstat <= median(lstat))TRUE
## 16.67747 11.71067
# the predicted value for low-lstat neighbourhoods is 16.68 + 11.71 = 28.39
cut()
function in the formula to
generate a piecewise regression model called pw5_mod
that
contains 5 equally spaced sections. Again, plot the result using
pred_plot
.pw5_mod <- lm(medv ~ cut(lstat, 5), data = Boston)
pred_plot(pw5_mod)
Note that the sections generated by cut()
are equally
spaced in terms of lstat
, but they do not have equal
amounts of data. In fact, the last section has only 9 data points to
work with:
table(cut(Boston$lstat, 5))
##
## (1.69,8.98] (8.98,16.2] (16.2,23.5] (23.5,30.7] (30.7,38]
## 183 183 94 37 9
pwq_mod
where the sections are not equally spaced, but have
equal amounts of training data. Hint: use the quantile()
function.brks <- c(-Inf, quantile(Boston$lstat, probs = c(.2, .4, .6, .8)), Inf)
pwq_mod <- lm(medv ~ cut(lstat, brks), data = Boston)
pred_plot(pwq_mod)
table(cut(Boston$lstat, brks))
##
## (-Inf,6.29] (6.29,9.53] (9.53,13.3] (13.3,18.1] (18.1, Inf]
## 102 101 101 101 101
Combining piecewise regression with polynomial regression, we can write a function that creates a matrix based on a piecewise cubic basis function:
vec
and
knots
variables, for example vec <- 1:20
and knots <- 2
and try out the lines
separately.piecewise_cubic_basis <- function(vec, knots = 1) {
# If there is only one section, just return the 3rd order polynomial
if (knots == 0) return(poly(vec, degree = 3, raw = TRUE))
# cut the vector
cut_vec <- cut(vec, breaks = knots + 1)
# initialise a matrix for the piecewise polynomial
out <- matrix(nrow = length(vec), ncol = 0)
# loop over the levels of the cut vector
for (lvl in levels(cut_vec)) {
# temporary vector
tmp <- vec
# set all values to 0 except the current section
tmp[cut_vec != lvl] <- 0
# add the polynomial based on this vector to the matrix
out <- cbind(out, poly(tmp, degree = 3, raw = TRUE))
}
# return the piecewise polynomial matrix
out
}
pc1_mod
- pc3_mod
) using this piecewise cubic
basis function. Compare them using the pred_plot()
function.pc1_mod <- lm(medv ~ piecewise_cubic_basis(lstat, 1), data = Boston)
pc2_mod <- lm(medv ~ piecewise_cubic_basis(lstat, 2), data = Boston)
pc3_mod <- lm(medv ~ piecewise_cubic_basis(lstat, 3), data = Boston)
pred_plot(pc1_mod)
pred_plot(pc2_mod)
pred_plot(pc3_mod)
We’re now going to take out the discontinuities from the piecewise cubic models by creating splines. First, we will manually create a cubic spline with 1 knot at the median by constructing a truncated power basis as per ISLR page 273, equation 7.10.
boston_tpb
with the
columns medv
and lstat
from the
Boston
dataset.boston_tpb <- Boston %>% as_tibble %>% select(medv, lstat)
mutate
to add squared and cubed
versions of the lstat
variable to this
dataset.boston_tpb <- boston_tpb %>% mutate(lstat2 = lstat^2, lstat3 = lstat^3)
mutate
to add a column
lstat_tpb
to this dataset which is 0 below the median and
has value (lstat - median(lstat))^3
above the median. Tip:
you may want to use ifelse()
within your
mutate()
call.boston_tpb <- boston_tpb %>%
mutate(lstat_tpb = ifelse(lstat > median(lstat), (lstat - median(lstat))^3, 0))
Now we have created a complete truncated power basis for a cubic spline fit.
tpb_mod
using the
lm()
function. How many predictors are in the model? How
many degrees of freedom does this model have?tpb_mod <- lm(medv ~ lstat + lstat2 + lstat3 + lstat_tpb, data = boston_tpb)
summary(tpb_mod)
##
## Call:
## lm(formula = medv ~ lstat + lstat2 + lstat3 + lstat_tpb, data = boston_tpb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5106 -3.0547 -0.7488 2.1178 27.1383
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 65.514246 3.054303 21.450 < 2e-16 ***
## lstat -10.324275 1.089886 -9.473 < 2e-16 ***
## lstat2 0.856576 0.116108 7.377 6.75e-13 ***
## lstat3 -0.025484 0.003810 -6.688 6.05e-11 ***
## lstat_tpb 0.026582 0.004291 6.194 1.22e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.206 on 501 degrees of freedom
## Multiple R-squared: 0.6822, Adjusted R-squared: 0.6796
## F-statistic: 268.8 on 4 and 501 DF, p-value: < 2.2e-16
# this model has 5 predictors and also 5 degrees of freedom
# See ISLR page 273
The bs()
function from the splines
package
does all the work for us that we have done in one function call.
bs1_mod
with a knot
at the median using the bs()
function. Compare its
predictions to those of the tpb_mod
using the
predict()
function on both models.bs1_mod <- lm(medv ~ bs(lstat, knots = median(lstat)), data = Boston)
summary(bs1_mod)
##
## Call:
## lm(formula = medv ~ bs(lstat, knots = median(lstat)), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5106 -3.0547 -0.7488 2.1178 27.1383
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50.085 1.549 32.34 <2e-16 ***
## bs(lstat, knots = median(lstat))1 -24.362 2.331 -10.45 <2e-16 ***
## bs(lstat, knots = median(lstat))2 -31.781 1.927 -16.49 <2e-16 ***
## bs(lstat, knots = median(lstat))3 -44.421 3.488 -12.73 <2e-16 ***
## bs(lstat, knots = median(lstat))4 -35.831 3.092 -11.59 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.206 on 501 degrees of freedom
## Multiple R-squared: 0.6822, Adjusted R-squared: 0.6796
## F-statistic: 268.8 on 4 and 501 DF, p-value: < 2.2e-16
# Comparing the predictions from the two models: negligible absolute difference
mean(abs(predict(bs1_mod) - predict(tpb_mod)))
## [1] 2.940854e-13
bs1_mod
object using the plot_pred()
function.pred_plot(bs1_mod)
Note that this line fits very well, but at the right end of the plot, the curve slopes up. Theoretically, this is unexpected – always pay attention to which predictions you are making and whether that behaviour is in line with your expectations.
The last extension we will look at is the natural spline. This works
in the same way as the cubic spline, with the additional constraint that
the function is required to be linear at the boundaries. The
ns()
function from the splines
package is for
generating the basis representation for a natural spline.
ns3_mod
)
with 3 degrees of freedom using the ns()
function. Plot it,
and compare it to the bs1_mod
.ns3_mod <- lm(medv ~ ns(lstat, df = 3), data = Boston)
pred_plot(ns3_mod)
# Still a good fit but with linear ends.
lin_mod
, pn3_mod
,
pw5_mod
, pc3_mod
, bs1_mod
, and
ns3_mod
and give them nice titles by adding
+ ggtitle("My title")
to the plot. You may use the function
plot_grid()
from the package cowplot
to put
your plots in a grid.library(cowplot)
plot_grid(
pred_plot(lin_mod) + ggtitle("Linear regression"),
pred_plot(pn3_mod) + ggtitle("Polynomial"),
pred_plot(pw5_mod) + ggtitle("Piecewise constant"),
pred_plot(pc3_mod) + ggtitle("Piecewise cubic"),
pred_plot(bs1_mod) + ggtitle("Cubic spline"),
pred_plot(ns3_mod) + ggtitle("Natural spline")
)
lin
, pn3
, pw5
,
pc3
, bs1
, and ns3
) has the lowest
out-of-sample MSE.# first create an mse function
mse <- function(y_true, y_pred) mean((y_true - y_pred)^2)
# add a 12 split column to the boston dataset so we can cross-validate
boston_cv <- Boston %>% mutate(split = sample(rep(1:12, length.out = nrow(Boston))))
# prepare an output matrix with 12 slots per method for mse values
output_matrix <- matrix(nrow = 12, ncol = 6)
colnames(output_matrix) <- c("lin", "pn3", "pw5", "pc3", "bs1", "ns3")
# loop over the splits, run each method, and return the mse values
for (i in 1:12) {
train <- boston_cv %>% filter(split != i)
test <- boston_cv %>% filter(split == i)
brks <- c(-Inf, 7, 15, 22, Inf)
lin_mod <- lm(medv ~ lstat, data = train)
pn3_mod <- lm(medv ~ poly(lstat, 3), data = train)
pw5_mod <- lm(medv ~ cut(lstat, brks), data = train)
pc3_mod <- lm(medv ~ piecewise_cubic_basis(lstat, 3), data = train)
bs1_mod <- lm(medv ~ bs(lstat, knots = median(lstat)), data = train)
ns3_mod <- lm(medv ~ ns(lstat, df = 3), data = train)
output_matrix[i, ] <- c(
mse(test$medv, predict(lin_mod, newdata = test)),
mse(test$medv, predict(pn3_mod, newdata = test)),
mse(test$medv, predict(pw5_mod, newdata = test)),
mse(test$medv, predict(pc3_mod, newdata = test)),
mse(test$medv, predict(bs1_mod, newdata = test)),
mse(test$medv, predict(ns3_mod, newdata = test))
)
}
# this is the comparison of the methods
colMeans(output_matrix)
## lin pn3 pw5 pc3 bs1 ns3
## 38.76726 29.36707 37.95038 29.22791 27.51021 28.68858
# we can show it graphically too
tibble(names = as_factor(colnames(output_matrix)),
mse = colMeans(output_matrix)) %>%
ggplot(aes(x = names, y = mse, fill = names)) +
geom_bar(stat = "identity") +
theme_minimal() +
scale_fill_viridis_d(guide = "none") +
labs(
x = "Method",
y = "Mean squared error",
title = "Comparing regression method prediction performance"
)
When you have finished the practical,
enclose all files of the project (i.e. all .R
and/or
.Rmd
files including the one with your answers, and the
.Rproj
file) in a zip file, and
hand in the zip here. Do so before next week’s lecture.