Introduction

In this practical, we will build upon the topics of the last two weeks, and deepen your understanding of various data wrangling and visualization techniques, using the tidyverse functionalities. Additionally, we briefly introduce the concept of missing data and how to visualize relations between the data and missingness.

Don’t forget to open a project file called 03_Exploratory.Rproj and to create your .Rmd or .R file to work in.

library(tidyverse)
library(magrittr)
library(mice)
library(DAAG)

The `boys` data

Today, we will work with the boys data from the R package mice, containing information from a sample of 748 boys. Information about this data can be obtained by running the code ?mice::boys.

When working with a new data set, it is always a good idea to familiarize yourself with the structure of the data.

head(boys)

tail(boys)

It seems like the data is sorted on the variable age. To verify this, we can test our intuition explicitly.

!is.unsorted(boys$age)

## [1] TRUE

Indeed, the data is sorted on age. In R, there is no is.sorted function, but we can test whether the data is sorted by testing whether the data is not unsorted.

Additionally, you saw that the data contains 9 variables (age, hgt, wgt, bmi, hc, gen, phb, tv, reg), and that the data suffers from missingness. To get a further understanding of the data, we can ask for a summary of all variables.

summary(boys)

##       age              hgt              wgt              bmi       
##  Min.   : 0.035   Min.   : 50.00   Min.   :  3.14   Min.   :11.77  
##  1st Qu.: 1.581   1st Qu.: 84.88   1st Qu.: 11.70   1st Qu.:15.90  
##  Median :10.505   Median :147.30   Median : 34.65   Median :17.45  
##  Mean   : 9.159   Mean   :132.15   Mean   : 37.15   Mean   :18.07  
##  3rd Qu.:15.267   3rd Qu.:175.22   3rd Qu.: 59.58   3rd Qu.:19.53  
##  Max.   :21.177   Max.   :198.00   Max.   :117.40   Max.   :31.74  
##                   NA's   :20       NA's   :4        NA's   :21     
##        hc          gen        phb            tv           reg     
##  Min.   :33.70   G1  : 56   P1  : 63   Min.   : 1.00   north: 81  
##  1st Qu.:48.12   G2  : 50   P2  : 40   1st Qu.: 4.00   east :161  
##  Median :53.00   G3  : 22   P3  : 19   Median :12.00   west :239  
##  Mean   :51.51   G4  : 42   P4  : 32   Mean   :11.89   south:191  
##  3rd Qu.:56.00   G5  : 75   P5  : 50   3rd Qu.:20.00   city : 73  
##  Max.   :65.00   NA's:503   P6  : 41   Max.   :25.00   NA's :  3  
##  NA's   :46                 NA's:503   NA's   :522

Next to looking at the numbers, it is often insightful to visualize some of the univariate distributions of the data, for example age and gen (genital Tanner stage).

1. Create a histogram of the variable age using the function geom_histogram().

boys %>%
  ggplot(aes(x = age)) +
  geom_histogram(fill = "dark green") +
  theme_minimal() +
  labs(title = "Distribution of age")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Most boys are relatively young, while there is
# also a substantial group of boys between the age
# of 10 and 20 years.

2. Create a bar chart of the variable gen using the function geom_bar().

boys %>%
  ggplot(aes(x = gen)) +
  geom_bar(fill = "dark green") +
  theme_minimal() +
  labs(title = "Distribution of genital Tanner stage")

Assessing missing data

Now we know that there is a substantial amount of missing data, it is time to assess how severe the missingness problem is. One way to do this is by asking for the missing data pattern, using the function md.pattern() from mice.

md.pattern(boys)

##     age reg wgt hgt bmi hc gen phb  tv     
## 223   1   1   1   1   1  1   1   1   1    0
## 19    1   1   1   1   1  1   1   1   0    1
## 1     1   1   1   1   1  1   1   0   1    1
## 1     1   1   1   1   1  1   0   1   0    2
## 437   1   1   1   1   1  1   0   0   0    3
## 43    1   1   1   1   1  0   0   0   0    4
## 16    1   1   1   0   0  1   0   0   0    5
## 1     1   1   1   0   0  0   0   0   0    6
## 1     1   1   0   1   0  1   0   0   0    5
## 1     1   1   0   0   0  1   1   1   1    3
## 1     1   1   0   0   0  0   1   1   1    4
## 1     1   1   0   0   0  0   0   0   0    7
## 3     1   0   1   1   1  1   0   0   0    4
##       0   3   4  20  21 46 503 503 522 1622

The md.pattern shows that there are 223 cases without any missing values, 525 observations with at least one value missing, and that there are 1622 missing values in total in the data, such that 24% of the data is missing.

Also, it becomes clear that the variables gen (genital Tanner stage), phb (pubic hair) and tv (testicular volume) have most missing values.

Now we now that there is a substantial amount of missing information in the boys data, we can more closely focus on the missing data patterns. A first step could be to test whether missingness in the variables gen, phb and tv depends on someones age.

3. Create a missingness indicator for the variables gen, phb and tv.

boys_mis <- boys %>%
  mutate(gen_mis = is.na(gen),
         phb_mis = is.na(phb),
         tv_mis  = is.na(tv))

4. Assess whether missingness in the variables gen, phb and tv is related to someones age.

Hint: Use the group_by() and summarize() functions from the dplyr library.

boys_mis %>%
  group_by(gen_mis) %>%
  summarize(age = mean(age))

boys_mis %>%
  group_by(phb_mis) %>%
  summarize(age = mean(age))

boys_mis %>%
  group_by(tv_mis) %>%
  summarize(age = mean(age))

# And we see that those with a missing value on the variables
# of interest have a substantial lower age than those with an
# observed value.

Although such differences can be useful, they rarely tell the complete story about the missingness. A useful strategy to assess missing data patterns is visualization, as figures generally tell more than just plain numbers.

5. Create a histogram for the variable age, faceted by whether or not someone has a missing value on gen.

Hint: Use the function facet_wrap (see help(facet_wrap)) to create two separate histograms for both groups.

boys_mis %>%
  ggplot(aes(x = age)) +
  geom_histogram(fill = "dark green") + 
  facet_wrap(~gen_mis) +
  theme_minimal()

# Now we see what we couldn't have seen before. Boys
# with an observed value on gen all are at least seven
# years old, while those with a missing value on gen
# are far more often between 0 and 5 years old.

Now we know somewhat more about the missingness problem in the boys dataset, we can continue visualizing other relationships in the data. The next step is to visualize the relationship between age and bmi.

6. Create a scatterplot with age on the x-axis and bmi on the y-axis, using the function geom_point().

boys_mis %>%
  ggplot(aes(x = age, y = bmi)) +
  geom_point() +
  theme_minimal() +
  labs(title = "Scatter plot of age versus bmi")

## Warning: Removed 21 rows containing missing values (`geom_point()`).

# Note that ggplot gives the warning that 21 rows are
# removed due to missingness in bmi.

Although we already know how missingness in the variable gen is related to age, we do not know whether gen is also related to bmi.

7. Add a colour aesthetic to the previous plot using the missingness indicator of the variable gen.

boys_mis %>%
  ggplot(aes(x = age, y = bmi, col = gen_mis)) +
  geom_point() +
  theme_minimal() +
  scale_color_viridis_d() +
  labs(title = "Scatter plot of age versus bmi")

## Warning: Removed 21 rows containing missing values (`geom_point()`).

# Again, we clearly see that younger boys generally have 
# a missing value on gen, while there does not seem to be
# much of a relation between missingness on gen and bmi

Usually, we would, at this point, handle the missingness problem accordingly. Yet, the scope of this course does not allow for a thorough module on dealing with missing data. If you are interested in this topic, or if you will need it during future (course)work, there are plenty of online resources. The short list of resources displayed below should get you going.

Description	Author & Title	Link
Missing data theory	Van Buuren - Flexible Imputation of Missing Data	Click
Missing data theory (somewhat more technical)	Little & Rubin - Statistical Analysis with Missing Data	Click
Tutorials on dealing with missing data	Vink & Van Buuren	Click (see `Vignettes`).

Visualizing the boys data

In the remainder of this practical, we will ignore the missing data problems, and focus on visualizing (relations in) the data.

8. Visualize the relationship between reg (region) and age using a boxplot.

boys_mis %>%
  ggplot(aes(x = reg, y = age)) +
  geom_boxplot(fill = "dark green") +
  theme_minimal() +
  labs(title = "Boxplot of age by region.")

# Boys in the northerns region seem a little older
# than the rest, boys with an NA on region seem to
# be much younger than the rest, while there does 
# not appear to be much of a difference in the 
# remaining categories.

9. Create a density plot of age, splitting the densities by gen using the fill aesthetic.

boys_mis %>%
  ggplot(aes(x = age, fill = gen)) +
  geom_density(alpha = 0.7) +
  theme_minimal() +
  scale_fill_brewer() +
  labs(title = "Density of age by genital Tanner stage")

# You'll see a clear relation between gen and age, 
# which is not really surprising, because physical 
# development is usually driven by aging.

10. Create a diverging bar chart for hgt in the boys data set, that displays for every age year that year’s mean height in deviations from the overall average hgt.

Hint: You will have to create a categorical age variable for every year, and center the variable hgt such that it reflects that person’s difference from the mean height in the entire data.

boys %>%
  mutate(Age = cut(age, 0:22, labels = paste0(0:21, " years")),
         Height = hgt - mean(hgt, na.rm = TRUE)) %>%
  group_by(Age) %>%
  summarize(Height = mean(Height, na.rm = TRUE)) %>%
  mutate(color = ifelse(Height > 0, "Above average", "Below average")) %>%
  ggplot(aes(x = Height, y = Age, fill = color)) +
  geom_bar(stat = "identity") +
  scale_fill_brewer(palette = "Set1") +
  theme_minimal() +
  theme(legend.title = element_blank())

# We can clearly see that the average height in the 
# group is reached just before age 7.

Regression visualization

So far, we have predominantly focused our visualization on variables in the data. We can easily extend this to visualizing models that are constructed from the data. Similarly to looking at visualizations of variables, visualizations of models can help interpreting the results and the fit of the model, by making things obvious that would be obscure when looking at the raw numbers.

For this exercise, we are going to work with the elastic band data sets from the package DAAG, providing information on the stretch of a rubber band. For more information on the data, call help(elastic1) or help(elastic2).

11. Load the data elastic1 and elastic2 and bind the data frames together using the function bind_rows() and add a grouping variable indicating whether an observation comes from elastic1 or from elastic2.

elastic <- bind_rows("Elastic1" = elastic1,
                     "Elastic2" = elastic2,
                     .id = "Set")

12. Create a scatterplot mapping stretch on the x-axis and distance on the y-axis, and map the just created group indicator as the color aesthetic.

elastic %>%
  ggplot(aes(x = stretch, y = distance, col = Set)) +
  geom_point() +
  scale_color_brewer(palette = "Set1") +
  theme_minimal() +
  labs(title = "Elastic bands data")

13. Recreate the previous plot, but now assess whether the results of the two data sets appear consistent by adding a linear regression line.

Hint: Use the function geom_smooth().

elastic %>%
  ggplot(aes(x = stretch, y = distance, col = Set)) +
  geom_point() +
  geom_smooth(method = "lm") +
  scale_color_brewer(palette = "Set1") +
  theme_minimal() +
  labs(title = "Elastic bands data")

# The results seem very consistent: Data set elastic2 has 
# more observations over a larger range, but both sets 
# result in roughly the same regression line. Data set 
# elastic1 seems to have an odd-one-out value.

14. For each of the data sets elastic1 and elastic2, fit a regression model with y = distance on x = stretch using lm(y ~ x, data).

fit1 <- lm(distance ~ stretch, elastic1)
fit2 <- lm(distance ~ stretch, elastic2)

15. For both of the previously created fitted models, determine the fitted values and the standard errors of the fitted values, and the proportion explained variance \(R^2\).

Hint: Check out the predict function in R and notice the se.fit argument, as well as the summary function.

# Check fitted values

fit1 %>% predict(se.fit = TRUE)

## $fit
##        1        2        3        4        5        6        7 
## 183.1429 235.7143 196.2857 209.4286 170.0000 156.8571 222.5714 
## 
## $se.fit
## [1]  6.586938 10.621119  5.891537  6.586938  8.331891 10.621119  8.331891
## 
## $df
## [1] 5
## 
## $residual.scale
## [1] 15.58754

fit2 %>% predict(se.fit = TRUE)

## $fit
##         1         2         3         4         5         6         7         8 
##  77.58333 196.58333 137.08333 166.83333 256.08333 226.33333 107.33333 226.33333 
##         9 
## 285.83333 
## 
## $se.fit
## [1] 6.740293 3.520003 4.358744 3.635444 5.060323 4.064550 5.453165 4.064550
## [9] 6.296773
## 
## $df
## [1] 7
## 
## $residual.scale
## [1] 10.44202

# Note that fit1 (based on elastic1) has a larger
# residual standard deviation (i.e., $residual.scale).

# Check R2

fit1 %>% summary()

## 
## Call:
## lm(formula = distance ~ stretch, data = elastic1)
## 
## Residuals:
##        1        2        3        4        5        6        7 
##  -0.1429 -18.7143  -7.2857  -1.4286   8.0000  -6.8571  26.4286 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -119.143     70.943  -1.679  0.15391   
## stretch        6.571      1.473   4.462  0.00663 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.59 on 5 degrees of freedom
## Multiple R-squared:  0.7992, Adjusted R-squared:  0.7591 
## F-statistic: 19.91 on 1 and 5 DF,  p-value: 0.006631

fit2 %>% summary()

## 
## Call:
## lm(formula = distance ~ stretch, data = elastic2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.0833  -7.0833  -0.5833   5.1667  20.1667 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -100.9167    15.6102  -6.465 0.000345 ***
## stretch        5.9500     0.3148  18.899 2.89e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.44 on 7 degrees of freedom
## Multiple R-squared:  0.9808, Adjusted R-squared:  0.978 
## F-statistic: 357.2 on 1 and 7 DF,  p-value: 2.888e-07

# Or directly
fit1 %>% summary() %$% r.squared

## [1] 0.7992446

fit2 %>% summary() %$% r.squared

## [1] 0.9807775

# Note that the model based on elastic2 has smaller standard
# errors and a much larger R2. This is due to the larger
# range of values in elastic2 and the absence of an outlier.

16. Study the residual versus leverage plots for both models.

Hint: Use plot(which = 5) on the fitted objects.

fit1 %>% plot(which = 5)

fit2 %>% plot(which = 5)

# For elastic1, case 2 has the largest influence on
# estimation. However, it is not the case with the 
# largest residual:
fit1$residuals

##           1           2           3           4           5           6 
##  -0.1428571 -18.7142857  -7.2857143  -1.4285714   8.0000000  -6.8571429 
##           7 
##  26.4285714

# As we can see, case 7 has the largest residual.

17. Use the elastic2 variable stretch to obtain predictions on the model fitted on elastic1.

pred <- predict(fit1, newdata = elastic2)

18. Now make a scatterplot to investigate similarity between the predicted values and the observed values for elastic2.

pred_dat <- 
  data.frame(distance = pred, 
             stretch  = elastic2$stretch) %>%
  bind_rows(Predicted = .,
            Observed  = elastic2, 
            .id = "Predicted")

pred_dat %>%
  ggplot(aes(stretch, distance, col = Predicted)) +
  geom_point() + 
  geom_smooth(method = "lm") +
  scale_color_brewer(palette = "Set1") +
  theme_minimal() +
  labs(title = "Predicted and observed distances")

# The predicted values are very similar to the observed
# values. However, they do not strictly follow the straight
# line because there is some modeling error: we use elastic1's
# model to predict elastic2's distance [error source 1] and
# we compare those predictions to elastic2's observed distance
# [error source 2]. However, if you consider the modeling,
# these predictions are very accurate and have high
# correlations with the observed values:

cor(elastic2$distance, pred)

## [1] 0.9903421

Hand-in

When you have finished the practical,

enclose all files of the project (i.e. all .R and/or .Rmd files including the one with your answers, and the .Rproj file) in a zip file or folder, and
hand in the zip here. Do so before next week’s lecture.