In this practical, we will build upon the topics of the last two
weeks, and deepen your understanding of various data wrangling and
visualization techniques, using the tidyverse
functionalities. Additionally, we briefly introduce the concept of
missing data and how to visualize relations between the data and
missingness.
Don’t forget to open a project file called
03_Exploratory.Rproj
and to create your .Rmd
or .R
file to work in.
boys
dataToday, we will work with the boys
data from the
R
package mice
, containing information from a
sample of 748 boys. Information about this data can be obtained by
running the code ?mice::boys
.
When working with a new data set, it is always a good idea to familiarize yourself with the structure of the data.
It seems like the data is sorted on the variable age
. To
verify this, we can test our intuition explicitly.
## [1] TRUE
Indeed, the data is sorted on age
. In R
,
there is no is.sorted
function, but we can test whether the
data is sorted by testing whether the data is not unsorted.
Additionally, you saw that the data contains 9 variables (age, hgt, wgt, bmi, hc, gen, phb, tv, reg), and that the data suffers from missingness. To get a further understanding of the data, we can ask for a summary of all variables.
## age hgt wgt bmi
## Min. : 0.035 Min. : 50.00 Min. : 3.14 Min. :11.77
## 1st Qu.: 1.581 1st Qu.: 84.88 1st Qu.: 11.70 1st Qu.:15.90
## Median :10.505 Median :147.30 Median : 34.65 Median :17.45
## Mean : 9.159 Mean :132.15 Mean : 37.15 Mean :18.07
## 3rd Qu.:15.267 3rd Qu.:175.22 3rd Qu.: 59.58 3rd Qu.:19.53
## Max. :21.177 Max. :198.00 Max. :117.40 Max. :31.74
## NA's :20 NA's :4 NA's :21
## hc gen phb tv reg
## Min. :33.70 G1 : 56 P1 : 63 Min. : 1.00 north: 81
## 1st Qu.:48.12 G2 : 50 P2 : 40 1st Qu.: 4.00 east :161
## Median :53.00 G3 : 22 P3 : 19 Median :12.00 west :239
## Mean :51.51 G4 : 42 P4 : 32 Mean :11.89 south:191
## 3rd Qu.:56.00 G5 : 75 P5 : 50 3rd Qu.:20.00 city : 73
## Max. :65.00 NA's:503 P6 : 41 Max. :25.00 NA's : 3
## NA's :46 NA's:503 NA's :522
Next to looking at the numbers, it is often insightful to visualize
some of the univariate distributions of the data, for example
age
and gen
(genital Tanner stage).
1. Create a histogram of the variable age
using
the function geom_histogram()
.
boys %>%
ggplot(aes(x = age)) +
geom_histogram(fill = "dark green") +
theme_minimal() +
labs(title = "Distribution of age")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Most boys are relatively young, while there is
# also a substantial group of boys between the age
# of 10 and 20 years.
2. Create a bar chart of the variable gen
using
the function geom_bar()
.
boys %>%
ggplot(aes(x = gen)) +
geom_bar(fill = "dark green") +
theme_minimal() +
labs(title = "Distribution of genital Tanner stage")
Now we know that there is a substantial amount of missing data, it is
time to assess how severe the missingness problem is. One way to do this
is by asking for the missing data pattern, using the function
md.pattern()
from mice
.
## age reg wgt hgt bmi hc gen phb tv
## 223 1 1 1 1 1 1 1 1 1 0
## 19 1 1 1 1 1 1 1 1 0 1
## 1 1 1 1 1 1 1 1 0 1 1
## 1 1 1 1 1 1 1 0 1 0 2
## 437 1 1 1 1 1 1 0 0 0 3
## 43 1 1 1 1 1 0 0 0 0 4
## 16 1 1 1 0 0 1 0 0 0 5
## 1 1 1 1 0 0 0 0 0 0 6
## 1 1 1 0 1 0 1 0 0 0 5
## 1 1 1 0 0 0 1 1 1 1 3
## 1 1 1 0 0 0 0 1 1 1 4
## 1 1 1 0 0 0 0 0 0 0 7
## 3 1 0 1 1 1 1 0 0 0 4
## 0 3 4 20 21 46 503 503 522 1622
The md.pattern
shows that there are 223 cases without
any missing values, 525 observations with at least one value
missing, and that there are 1622 missing values in total in the data,
such that 24% of the data is missing.
Also, it becomes clear that the variables gen
(genital
Tanner stage), phb
(pubic hair) and tv
(testicular volume) have most missing values.
Now we now that there is a substantial amount of missing information
in the boys
data, we can more closely focus on the missing
data patterns. A first step could be to test whether missingness in the
variables gen
, phb
and tv
depends
on someones age.
3. Create a missingness indicator for the variables
gen
, phb
and tv
.
4. Assess whether missingness in the variables
gen
, phb
and tv
is related to
someones age.
Hint: Use the group_by()
and
summarize()
functions from the dplyr
library.
# And we see that those with a missing value on the variables
# of interest have a substantial lower age than those with an
# observed value.
Although such differences can be useful, they rarely tell the complete story about the missingness. A useful strategy to assess missing data patterns is visualization, as figures generally tell more than just plain numbers.
5. Create a histogram for the variable age
,
faceted by whether or not someone has a missing value on
gen
.
Hint: Use the function facet_wrap
(see
help(facet_wrap)
) to create two separate histograms for
both groups.
boys_mis %>%
ggplot(aes(x = age)) +
geom_histogram(fill = "dark green") +
facet_wrap(~gen_mis) +
theme_minimal()
# Now we see what we couldn't have seen before. Boys
# with an observed value on gen all are at least seven
# years old, while those with a missing value on gen
# are far more often between 0 and 5 years old.
Now we know somewhat more about the missingness problem in the
boys
dataset, we can continue visualizing other
relationships in the data. The next step is to visualize the
relationship between age
and bmi
.
6. Create a scatterplot with age
on the x-axis
and bmi
on the y-axis, using the function
geom_point()
.
boys_mis %>%
ggplot(aes(x = age, y = bmi)) +
geom_point() +
theme_minimal() +
labs(title = "Scatter plot of age versus bmi")
## Warning: Removed 21 rows containing missing values (`geom_point()`).
Although we already know how missingness in the variable
gen
is related to age
, we do not know whether
gen
is also related to bmi
.
7. Add a colour aesthetic to the previous plot using the
missingness indicator of the variable gen
.
boys_mis %>%
ggplot(aes(x = age, y = bmi, col = gen_mis)) +
geom_point() +
theme_minimal() +
scale_color_viridis_d() +
labs(title = "Scatter plot of age versus bmi")
## Warning: Removed 21 rows containing missing values (`geom_point()`).
# Again, we clearly see that younger boys generally have
# a missing value on gen, while there does not seem to be
# much of a relation between missingness on gen and bmi
Usually, we would, at this point, handle the missingness problem accordingly. Yet, the scope of this course does not allow for a thorough module on dealing with missing data. If you are interested in this topic, or if you will need it during future (course)work, there are plenty of online resources. The short list of resources displayed below should get you going.
Description | Author & Title | Link |
---|---|---|
Missing data theory | Van Buuren - Flexible Imputation of Missing Data | Click |
Missing data theory (somewhat more technical) | Little & Rubin - Statistical Analysis with Missing Data | Click |
Tutorials on dealing with missing data | Vink & Van Buuren | Click (see
Vignettes ). |
In the remainder of this practical, we will ignore the missing data problems, and focus on visualizing (relations in) the data.
8. Visualize the relationship between reg
(region) and age
using a boxplot.
boys_mis %>%
ggplot(aes(x = reg, y = age)) +
geom_boxplot(fill = "dark green") +
theme_minimal() +
labs(title = "Boxplot of age by region.")
# Boys in the northerns region seem a little older
# than the rest, boys with an NA on region seem to
# be much younger than the rest, while there does
# not appear to be much of a difference in the
# remaining categories.
9. Create a density plot of age, splitting the densities by
gen
using the fill
aesthetic.
boys_mis %>%
ggplot(aes(x = age, fill = gen)) +
geom_density(alpha = 0.7) +
theme_minimal() +
scale_fill_brewer() +
labs(title = "Density of age by genital Tanner stage")
# You'll see a clear relation between gen and age,
# which is not really surprising, because physical
# development is usually driven by aging.
10. Create a diverging bar chart for hgt
in the
boys
data set, that displays for every age
year that year’s mean height in deviations from the overall average
hgt
.
Hint: You will have to create a categorical age variable for
every year, and center the variable hgt
such that it
reflects that person’s difference from the mean height in the entire
data.
boys %>%
mutate(Age = cut(age, 0:22, labels = paste0(0:21, " years")),
Height = hgt - mean(hgt, na.rm = TRUE)) %>%
group_by(Age) %>%
summarize(Height = mean(Height, na.rm = TRUE)) %>%
mutate(color = ifelse(Height > 0, "Above average", "Below average")) %>%
ggplot(aes(x = Height, y = Age, fill = color)) +
geom_bar(stat = "identity") +
scale_fill_brewer(palette = "Set1") +
theme_minimal() +
theme(legend.title = element_blank())
So far, we have predominantly focused our visualization on variables in the data. We can easily extend this to visualizing models that are constructed from the data. Similarly to looking at visualizations of variables, visualizations of models can help interpreting the results and the fit of the model, by making things obvious that would be obscure when looking at the raw numbers.
For this exercise, we are going to work with the elastic band data
sets from the package DAAG
, providing information on the
stretch of a rubber band. For more information on the data, call
help(elastic1)
or help(elastic2)
.
11. Load the data elastic1
and
elastic2
and bind the data frames together using the
function bind_rows()
and add a grouping variable indicating
whether an observation comes from elastic1
or from
elastic2
.
12. Create a scatterplot mapping stretch
on the
x-axis and distance
on the y-axis, and map the just created
group indicator as the color aesthetic.
elastic %>%
ggplot(aes(x = stretch, y = distance, col = Set)) +
geom_point() +
scale_color_brewer(palette = "Set1") +
theme_minimal() +
labs(title = "Elastic bands data")
13. Recreate the previous plot, but now assess whether the results of the two data sets appear consistent by adding a linear regression line.
Hint: Use the function geom_smooth()
.
elastic %>%
ggplot(aes(x = stretch, y = distance, col = Set)) +
geom_point() +
geom_smooth(method = "lm") +
scale_color_brewer(palette = "Set1") +
theme_minimal() +
labs(title = "Elastic bands data")
# The results seem very consistent: Data set elastic2 has
# more observations over a larger range, but both sets
# result in roughly the same regression line. Data set
# elastic1 seems to have an odd-one-out value.
14. For each of the data sets elastic1
and
elastic2
, fit a regression model with
y = distance
on x = stretch
using
lm(y ~ x, data)
.
15. For both of the previously created fitted models, determine the fitted values and the standard errors of the fitted values, and the proportion explained variance \(R^2\).
Hint: Check out the predict
function in
R
and notice the se.fit
argument, as well as
the summary
function.
## $fit
## 1 2 3 4 5 6 7
## 183.1429 235.7143 196.2857 209.4286 170.0000 156.8571 222.5714
##
## $se.fit
## [1] 6.586938 10.621119 5.891537 6.586938 8.331891 10.621119 8.331891
##
## $df
## [1] 5
##
## $residual.scale
## [1] 15.58754
## $fit
## 1 2 3 4 5 6 7 8
## 77.58333 196.58333 137.08333 166.83333 256.08333 226.33333 107.33333 226.33333
## 9
## 285.83333
##
## $se.fit
## [1] 6.740293 3.520003 4.358744 3.635444 5.060323 4.064550 5.453165 4.064550
## [9] 6.296773
##
## $df
## [1] 7
##
## $residual.scale
## [1] 10.44202
# Note that fit1 (based on elastic1) has a larger
# residual standard deviation (i.e., $residual.scale).
# Check R2
fit1 %>% summary()
##
## Call:
## lm(formula = distance ~ stretch, data = elastic1)
##
## Residuals:
## 1 2 3 4 5 6 7
## -0.1429 -18.7143 -7.2857 -1.4286 8.0000 -6.8571 26.4286
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -119.143 70.943 -1.679 0.15391
## stretch 6.571 1.473 4.462 0.00663 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.59 on 5 degrees of freedom
## Multiple R-squared: 0.7992, Adjusted R-squared: 0.7591
## F-statistic: 19.91 on 1 and 5 DF, p-value: 0.006631
##
## Call:
## lm(formula = distance ~ stretch, data = elastic2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.0833 -7.0833 -0.5833 5.1667 20.1667
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -100.9167 15.6102 -6.465 0.000345 ***
## stretch 5.9500 0.3148 18.899 2.89e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.44 on 7 degrees of freedom
## Multiple R-squared: 0.9808, Adjusted R-squared: 0.978
## F-statistic: 357.2 on 1 and 7 DF, p-value: 2.888e-07
## [1] 0.7992446
## [1] 0.9807775
# Note that the model based on elastic2 has smaller standard
# errors and a much larger R2. This is due to the larger
# range of values in elastic2 and the absence of an outlier.
16. Study the residual versus leverage plots for both models.
Hint: Use plot(which = 5)
on the fitted
objects.
# For elastic1, case 2 has the largest influence on
# estimation. However, it is not the case with the
# largest residual:
fit1$residuals
## 1 2 3 4 5 6
## -0.1428571 -18.7142857 -7.2857143 -1.4285714 8.0000000 -6.8571429
## 7
## 26.4285714
17. Use the elastic2
variable
stretch
to obtain predictions on the model fitted on
elastic1
.
18. Now make a scatterplot to investigate similarity between
the predicted values and the observed values for
elastic2
.
pred_dat <-
data.frame(distance = pred,
stretch = elastic2$stretch) %>%
bind_rows(Predicted = .,
Observed = elastic2,
.id = "Predicted")
pred_dat %>%
ggplot(aes(stretch, distance, col = Predicted)) +
geom_point() +
geom_smooth(method = "lm") +
scale_color_brewer(palette = "Set1") +
theme_minimal() +
labs(title = "Predicted and observed distances")
# The predicted values are very similar to the observed
# values. However, they do not strictly follow the straight
# line because there is some modeling error: we use elastic1's
# model to predict elastic2's distance [error source 1] and
# we compare those predictions to elastic2's observed distance
# [error source 2]. However, if you consider the modeling,
# these predictions are very accurate and have high
# correlations with the observed values:
cor(elastic2$distance, pred)
## [1] 0.9903421
When you have finished the practical,
enclose all files of the project (i.e. all .R
and/or
.Rmd
files including the one with your answers, and the
.Rproj
file) in a zip file or folder, and
hand in the zip here. Do so before next week’s lecture.