This is the first programming practical. If you haven’t yet done so,
open the project file 01_Data_Wrangling.Rproj
in RStudio.
You can choose to write the answers to your exercises in either an
.R
file or in an .Rmd
file. Example answer
files are provided in the project directory
(example_answers.Rmd
and example_answers.R
).
You can open these from the files pane and use them as a starting point.
While working through the exercises, write down your code in one of
these files. Use proper style and provide comments so you can read it
back later and still understand what is happening.
The practicals always start with the packages we are going to use. Be
sure to run these lines in your session to load their functions before
you continue. If there are packages that you have not yet installed,
first install them with install.packages()
.
There are several data types in R
. Here is a table with
the most common ones:
Type | Short | Example |
---|---|---|
Integer | int | 0, 1, 2, 3, -4, -5 |
Numeric / Double | dbl | 0.1, -2.5, 123.456 |
Character | chr | “dav is a cool course” |
Logical | lgl | TRUE / FALSE |
Factor | fctr | low, medium, high |
The class()
function can give you an idea about what
type of data each variable contains.
R
and inspect their
data types using the class()
function. Try to guess
beforehand what their types will be!object_1 <- 1:5
object_2 <- 1L:5L
object_3 <- "-123.456"
object_4 <- as.numeric(object_2)
object_5 <- letters[object_1]
object_6 <- as.factor(rep(object_5, 2))
object_7 <- c(1, 2, 3, "4", "5", "6")
The factor data type is special to R
and uncommon in
other programming languages. It is used to represent categorical
variables with fixed possible values. For example, when there is a
multiple choice question with 5 possible choices (a to e) and 10
students answer the question, we may get a result as in
object_6
.
Vectors can have only a single data type. Note that the first three
elements in object_7
have been converted. We can convert to
different data types using the as.<class>()
functions.
object_7
back to a vector of numbers
using the as.numeric()
functionA list is a collection of objects. The elements may have names, but it is not necessary. Each element of a list can have a different data type.
objects
containing object 1
to 7 using the list()
function.You can select elements of a list using its name
(objects$elementname
) or using its index
(objects[[1]]
for the first element).
A special type of list is the data.frame
. It is the same
as a list, but each element is forced to have the same length and a
name. The elements of a data.frame
are the columns of a
dataset. In the tidyverse, data.frame
s are called
tibble
s.
object_1
,
object_2
, and object_5
using the
data.frame()
functionJust like a list, the columns in a data frame (the variables in a
dataset) can be accessed using their name df$columnname
or
their index df[[1]]
. Additionally, the tenth row can be
selected using df[10, ]
, the second column using
df[, 2]
and cell number 10, 2 can be accessed using
df[10, 2]
. This is because data frames also behave like the
matrix
data type in addition to the list
type.
ncol()
and nrow()
. Try them
out!We are going to use a dataset from Kaggle - the Google play store
apps data by user lava18
. We have downloaded it into the
data folder already from https://www.kaggle.com/lava18/google-play-store-apps
(downloaded on 2018-09-28).
Tidyverse contains many data loading functions – each for their own
file type – in the packages readr
(default file types),
readxl
(excel files), and haven
(external file
types such as from SPSS or Stata). The most common file type is
csv
, which is what we use here.
read_csv()
to import the file
“googleplaystore.csv” and store it in a variable called
apps
.Above I load the data directly from the online source. Naturally, this is most efficient in terms of data storage and energy conservation. For reasons of simplicity, I will adopt a workflow below that would mimic a local scenario. In that scenario I have a RStudio project, with a data subfolder where my data are stored. If you’d like to run the code below, then either grab the archive from the course website on GitHub (see the address above) or create your own data folder relative to the root of you project or RMD file.
If necessary, use the help files. These import functions from the
tidyverse are fast and safe: they display informative errors if anything
goes wrong. read_csv()
also displays a message with
information on how each column is imported: which variable type each
column gets.
head()
to look at the first
few rows of the apps
datasetreadxl
). Also
try out the function tail()
and View()
(with a
capital V).summary()
function. What is the range of
the grades achieved by the students?dplyr
verbsThe tidyverse package dplyr
contains functions to
transform, rearrange, and filter data frames.
The first verb is filter()
, which selects rows from a
data frame. Chapter 5 of
R4DS states that to use filtering effectively, you have to know how
to select the observations that you want using the comparison operators.
R provides the standard suite: >
, >=
,
<
, <=
, !=
(not equal), and
==
(equal).
When you’re starting out with R, the easiest mistake to make is to
use =
instead of ==
when testing for
equality.
filter()
(especially
the examples) and show the students with a grade lower than
5.5If you are unsure how to proceed, read Section 5.2.2 from R4DS.
The second verb is arrange()
, which sorts a data frame
by one or more columns.
The third verb is select()
, which selects columns of
interest.
student_number
and
programme
columns from the students datasetWith mutate()
you can compute new columns and transform
existing columns as functions of the columns in your dataset. For
example, we may create a new logical column in the students dataset to
indicate whether a student has passed or failed:
Now, the students dataset has an extra column named “pass”.
You can also transform existing columns with the
mutate()
function. For example, we may want to transform
the programme column to an actual programme name according to this
table:
Code | Name |
---|---|
A | Science |
B | Social Science |
mutate()
and recode()
to
change the codes in the programme column of the students dataset to
their names. Store the result in a variable called
students_recoded
Chapter 5 of R4DS neatly summarises the five key dplyr functions that allow you to solve the vast majority of your data manipulation challenges:
filter()
).arrange()
).select()
).mutate()
).Cleaning data files and extracting the most useful information is essential to any downstream steps such as plotting or analysis. Make sure you know exactly which variable types are in your tibbles / data frames!
A very useful feature in tidyverse is the pipe %>%
.
The pipe inputs the result from the left-hand side as the first argument
of the right-hand side function:
filter(students, grade > 5.5)
becomes
students %>% filter(grade > 5.5)
. With the pipe, a
set of processing steps becomes a neatly legible data processing
pipeline!
Different tasks we have performed on the students dataset can be done in one pipeline like so:
students_dataset <-
read_xlsx("data/students.xlsx") %>%
mutate(prog = recode(programme, "A" = "Science", "B" = "Social Science")) %>%
filter(grade > 5.5) %>%
arrange(programme, -grade) %>%
select(student_number, prog, grade)
students_dataset
In one statement, we have loaded the dataset from disk, recoded the programme variable, filtered only students that pass, reordered the rows and selected the relevant columns only. We did not need to save intermediate results or nest functions deeply.
mutate
and parse_number()
, (c) shows only apps
with more than 500 000 000 downloads, (d) orders them by rating (best on
top), and (e) shows only the relevant columns (you can choose which are
relevant, but select at least the Rating
and
Category
variables). Save the result under the name
popular_apps
.If you find duplicates, you may need to use
distinct(App, .keep_all = TRUE)
as the last step in your
pipeline to remove duplicate app names. Tip:
ctrl/cmd + shift + M
inserts a pipe operator in
RStudio.
We have now seen how we can transform and clean our datasets. The
next step is to start exploring the dataset by computing relevant
summary statistics, such as means, ranges, variances, differences, etc.
We have already used the function summary()
which comes
with R
, but dplyr
has extra summary
functionality in the form of the summarise()
(or
summarize()
) verb.
An example to get the mean grade of the students_dataset
we made earlier is below:
students_dataset %>%
summarise(
mean = mean(grade),
variance = var(grade),
min = min(grade),
max = max(grade)
)
The summarise()
function works with any function that
takes a vector of numbers and outputs a single number. For example, we
can create our own Median
Absolute Deviation (MAD) function:
By itself, the summarise()
function is not very useful;
we can also simply use the summary()
function or directly
enter the vector we are interested in as an argument to the functions:
mad(students_dataset$grade)
= 0.5908503. The power of
summarise()
is in its combined use with the
group_by()
function, which makes it easy to make grouped
summaries:
students_dataset %>%
group_by(prog) %>%
summarise(
mean = mean(grade),
variance = var(grade),
min = min(grade),
max = max(grade)
)
When you have finished the practical,
enclose all files of the project (i.e. all .R
and/or
.Rmd
files including the one with your answers, and the
.Rproj
file) in a zip file, and
hand in the zip here. Do so before next week’s lecture.