Introduction and Data Visualization

Supervised Learning and Visualization

This lecture

Course pages
Course overview
Introduction to SLV
Data Wrangling
Data manipulation
Basic analyses ([linear regression], correlation & t-test)
Pipes
Wrap-up

Procedural stuff

If there is anything important - contact me!
The on-location lectures will not be recorded.
- If you are ill, ask your classmates to cover for you.
If you feel that you are stuck, ask your classmates, ask me, ask the other lecturers. Ask a lot! Ask questions during/after the lectures and in the Q&A sessions!
- You are most likely not the only one with that question. You are simply the bravest or the first.
- Do not contact us via private chat or e-mail for content-related questions.

Course pages

You can find all materials at the following location:

https://dgoretzko.github.io/slv/

Course overview

Team

David Goretzko, Maarten Cruyff and Erik-Jan van Kesteren

All three have a PhD in statistics and a ton of experience in development, data analysis and visualization.

Topics

Week #	Focus	Practical	Materials	Prof
1	Data wrangling with `R` and the grammar of graphics	`tidyverse: filter(), select(), join(), pivot(), dbplyr`, `ggplot()`: `geoms`, `aesthetics`, `scales`, `themes`	R4DS	DG
2	Exploratory data analysis	`Histograms`, `density plots`, `boxplots`, etc.	R4DS FIMD Ch1	MC
3	Statistical learning: regression	`lm()`, `glm()`, `knn()`	ISLR	DG
4	Statistical learning: classification	`glm()`, trees, `lda()`	ISLR	EJvK
5	Classification model evaluation	`prop.table()`, `pROC()`, etc.	ISLR	EJvK
6	Nonlinear models	`R` formulas advanced: `I()`, splines, e.g., `bs()`	ISLR	MC
7	Bagging, boosting, random forest and support vector machines	`randomforest`, `xgboost`	ISLR	MC
8	Benchmarking	`mlr3verse`	ISLR mlr3-book	DG

Course Setup

Each weak we have the following:

1 Lecture on Monday @ 9am in Dalton 500 8.27
1 Practical (not graded). Must be submitted to pass. Hand in the practical before the next lecture
1 combined workgroup and Q&A session in Dalton 500 8.27
Course materials to study. See the corresponding week on the course page.

Twice we have:

Group assignments
The assignment is made in teams (3-4 students).
Each assignment counts towards 25% of the total grade. Must be > 5.5 to pass.

Once we have:

Individual exam (one part theory [paper-pencil], one part programming)
BYOD: so charge and bring your laptop.
50% of total grade. Must be > 5.5 to pass.

Groups

We will form groups on Wednesday Sept 11!

Introduction to SLV

Introduction SLV

What are statistical learning and visualization?
How does it connect to data analysis?
Why do we need the above?
What types of analyses and learning are there?

Some example questions

Who will win the election?
Is the climate changing?
Why are women underrepresented in STEM degrees?
What is the best way to prevent heart failure?
Who is at risk of crushing debt?
Is this matter undergoing a phase transition?
What kind of topics are popular on Twitter?
How familiar are incoming DAV students with several DAV topics?

Goals in data analysis

Description:
What happened?
Prediction:
What will happen?
Explanation:
Why did/does something happen?
Prescription:
What should we do?

Modes in data analysis

Exploratory:
Mining for interesting patterns or results
Confirmatory:
Testing hypotheses

Some examples

	Exploratory	Confirmatory
Description	EDA; unsupervised learning	Correlation analysis
Prediction	Supervised learning	Theoretical modeling
Explanation	Visual mining	Causal inference
Prescription	Personalised medicine	A/B testing

In this course

Exploratory Data Analysis:
Describing interesting patterns: use graphs, summaries, to understand subgroups, detect anomalies, understand the data
Examples: boxplot, five-number summary, histograms, missing data plots, …
Supervised learning:
Regression: predict continuous labels from other values.
Examples: linear regression, generalized additive model, regression trees,…
Classification: predict discrete labels from other values.
Examples: logistic regression, support vector machines, classification trees, …

image source

Exploratory Data Analysis workflow

image source

Data analysis

How do you think that data analysis relates to:

“Data analytics”?
“Data modeling”?
“Machine learning”?
“Statistical learning”?
“Statistics”?
“Data science”?
“Data mining”?
“Knowledge discovery”?

Explanation

People from different fields (such as statistics, computer science, information science, industry) have different goals and different standard approaches.

We often use the same techniques.
We just use different terms to highlight different aspects of so-called data analysis.
All the terms on the previous slides are not exact synonyms.
But according to most people they carry the same analytic intentions.

In this course we emphasize on drawing insights that help us understand the data.

Some examples

How wages differ

Source: ISLR2, pp. 2

The origin of cholera

Source: wikimedia commons

Predicting the outcome of elections

Source: Hans Muster

Google Flu Trends

Google used a linear model to calculate the log-odds of Influence-like illness (ILI) physician visit and the log-odds of ILI-related search queries per \[\operatorname{logit}(P)=\beta _{0}+\beta _{1}\times \operatorname{logit}(Q)+\epsilon\] where $P$ is the percentage of ILI physician visit and $Q$ is the ILI-related query fraction computed.

Ginsberg, J., Mohebbi, M., Patel, R. et al. Detecting influenza epidemics using search engine query data. Nature 457, 1012–1014 (2009)

Identifying Brontës from Austen

Eder, Maciej & Rybicki, Jan & Kestemont, Mike. (2016). Stylometry with R: A Package for Computational Text Analysis. The R Journal. 8. 107-121. 10.32614/RJ-2016-007.

Measle Vaccines

Source: Mike Lee

Data wrangling

Wrangling in the pipeline

Data wrangling is the process of transforming and mapping data from one “raw” data form into another format.

The process is often iterative
The goal is to add purpose and value to the data in order to maximize the downstream analytical gains

Source: R4DS

Core ideas

Discovering: The first step of data wrangling is to gain a better understanding of the data: different data are worked and organized in different ways.
Structuring:The next step is to organize the data. Raw data are typically unorganized and much of it may not be useful for the end product. This step is important for easier computation and analysis in the later steps.
Cleaning: There are many different forms of cleaning data, for example one form of cleaning data is catching dates formatted in a different way and another form is removing outliers that will skew results or formatting null values. This step is important in assuring the overall quality of the data.
Enriching: At this step determine whether or not additional data would benefit the data set that could be easily added.
Validating: This step is similar to structuring and cleaning. Use repetitive sequences of validation rules to assure data consistency as well as quality and security. An example of a validation rule is confirming the accuracy of fields via cross checking data.
Publishing: Prepare the data set for use downstream, which could include use for users or software. Be sure to document any steps and logic during wrangling.

Source: Trifacta

Data Manipulation

R

In this course we use R for data analysis and data visualization
For those of you who are not familiar with it, we have a very brief intro to R on the website of our colleague Gerko!

We use the following packages

library(MASS)     # for the cats data
library(dplyr)    # data manipulation
library(haven)    # in/exporting data
library(magrittr) # pipes
library(ggplot2)  # Plotting device

loading packages

library(mice)     # missing data

## 
## Attaching package: 'mice'

## The following object is masked from 'package:stats':
## 
##     filter

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

Data manipulation

The cats data

head(cats)

##   Sex Bwt Hwt
## 1   F 2.0 7.0
## 2   F 2.0 7.4
## 3   F 2.0 9.5
## 4   F 2.1 7.2
## 5   F 2.1 7.3
## 6   F 2.1 7.6

str(cats)

## 'data.frame':    144 obs. of  3 variables:
##  $ Sex: Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Bwt: num  2 2 2 2.1 2.1 2.1 2.1 2.1 2.1 2.1 ...
##  $ Hwt: num  7 7.4 9.5 7.2 7.3 7.6 8.1 8.2 8.3 8.5 ...

How to select only Female cats?

fem.cats <- cats[cats$Sex == "F", ]
dim(fem.cats)

## [1] 47  3

head(fem.cats)

##   Sex Bwt Hwt
## 1   F 2.0 7.0
## 2   F 2.0 7.4
## 3   F 2.0 9.5
## 4   F 2.1 7.2
## 5   F 2.1 7.3
## 6   F 2.1 7.6

How to select only heavy cats?

heavy.cats <- cats[cats$Bwt > 3, ]
dim(heavy.cats)

## [1] 36  3

head(heavy.cats)

##     Sex Bwt  Hwt
## 109   M 3.1  9.9
## 110   M 3.1 11.5
## 111   M 3.1 12.1
## 112   M 3.1 12.5
## 113   M 3.1 13.0
## 114   M 3.1 14.3

How to select only heavy cats?

heavy.cats <- subset(cats, Bwt > 3)
dim(heavy.cats)

## [1] 36  3

head(heavy.cats)

##     Sex Bwt  Hwt
## 109   M 3.1  9.9
## 110   M 3.1 11.5
## 111   M 3.1 12.1
## 112   M 3.1 12.5
## 113   M 3.1 13.0
## 114   M 3.1 14.3

more flexible: `dplyr`

filter(cats, Bwt > 2, Bwt < 2.2, Sex == "F")

##   Sex Bwt Hwt
## 1   F 2.1 7.2
## 2   F 2.1 7.3
## 3   F 2.1 7.6
## 4   F 2.1 8.1
## 5   F 2.1 8.2
## 6   F 2.1 8.3
## 7   F 2.1 8.5
## 8   F 2.1 8.7
## 9   F 2.1 9.8

Working with factors

class(cats$Sex)

## [1] "factor"

levels(cats$Sex)

## [1] "F" "M"

Working with factors

levels(cats$Sex) <- c("Female", "Male")
table(cats$Sex)

## 
## Female   Male 
##     47     97

head(cats)

##      Sex Bwt Hwt
## 1 Female 2.0 7.0
## 2 Female 2.0 7.4
## 3 Female 2.0 9.5
## 4 Female 2.1 7.2
## 5 Female 2.1 7.3
## 6 Female 2.1 7.6

Releveling factors

lm(Hwt ~ Bwt + Sex, data = cats)

## 
## Call:
## lm(formula = Hwt ~ Bwt + Sex, data = cats)
## 
## Coefficients:
## (Intercept)          Bwt      SexMale  
##     -0.4150       4.0758      -0.0821

cats$Sex <- relevel(cats$Sex, ref = "Male")
lm(Hwt ~ Bwt + Sex, data = cats)

## 
## Call:
## lm(formula = Hwt ~ Bwt + Sex, data = cats)
## 
## Coefficients:
## (Intercept)          Bwt    SexFemale  
##     -0.4970       4.0758       0.0821

Sorting

order(cats$Bwt)

##   [1]   1   2   3  48  49   4   5   6   7   8   9  10  11  12  50  13  14  15
##  [19]  16  17  18  51  52  53  54  55  56  57  58  19  20  21  22  23  24  25
##  [37]  26  27  28  29  30  59  31  32  33  34  60  61  62  63  64  35  36  65
##  [55]  66  67  68  69  70  71  72  37  38  39  73  74  75  76  77  78  40  41
##  [73]  42  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  43
##  [91]  44  45  95  96  97  98  99  46  47 100 101 102 103 104 105 106 107 108
## [109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
## [127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144

sorted.cats <- cats[order(cats$Bwt), ]
head(sorted.cats)

##       Sex Bwt Hwt
## 1  Female 2.0 7.0
## 2  Female 2.0 7.4
## 3  Female 2.0 9.5
## 48   Male 2.0 6.5
## 49   Male 2.0 6.5
## 4  Female 2.1 7.2

Better sorting: `arrange()`

sorted.cats2 <- arrange(cats, Bwt)
head(sorted.cats)

##       Sex Bwt Hwt
## 1  Female 2.0 7.0
## 2  Female 2.0 7.4
## 3  Female 2.0 9.5
## 48   Male 2.0 6.5
## 49   Male 2.0 6.5
## 4  Female 2.1 7.2

head(sorted.cats2)

##      Sex Bwt Hwt
## 1 Female 2.0 7.0
## 2 Female 2.0 7.4
## 3 Female 2.0 9.5
## 4   Male 2.0 6.5
## 5   Male 2.0 6.5
## 6 Female 2.1 7.2

Better sorting: `arrange()`

cats.2 <- arrange(cats, Bwt, desc(Hwt))
head(cats.2, n = 10)

##       Sex Bwt  Hwt
## 1  Female 2.0  9.5
## 2  Female 2.0  7.4
## 3  Female 2.0  7.0
## 4    Male 2.0  6.5
## 5    Male 2.0  6.5
## 6    Male 2.1 10.1
## 7  Female 2.1  9.8
## 8  Female 2.1  8.7
## 9  Female 2.1  8.5
## 10 Female 2.1  8.3

Better sorting: `arrange()`

cats.2 <- arrange(cats, desc(Hwt), Bwt)
head(cats.2, n = 10)

##     Sex Bwt  Hwt
## 1  Male 3.9 20.5
## 2  Male 3.5 17.2
## 3  Male 3.8 16.8
## 4  Male 3.5 15.7
## 5  Male 3.5 15.6
## 6  Male 3.3 15.4
## 7  Male 3.6 15.0
## 8  Male 3.3 14.9
## 9  Male 3.6 14.8
## 10 Male 3.8 14.8

Basic analyses

Correlation

cor(cats[, -1])

##           Bwt       Hwt
## Bwt 1.0000000 0.8041274
## Hwt 0.8041274 1.0000000

With [, -1] we exclude the first column

Correlation

cor.test(cats$Bwt, cats$Hwt)

## 
##  Pearson's product-moment correlation
## 
## data:  cats$Bwt and cats$Hwt
## t = 16.119, df = 142, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7375682 0.8552122
## sample estimates:
##       cor 
## 0.8041274

What do we conclude?

Correlation

plot(cats$Bwt, cats$Hwt)

T-test

Test the null hypothesis that the difference in mean heart weight between male and female cats is 0

t.test(formula = Hwt ~ Sex, data = cats)

## 
##  Welch Two Sample t-test
## 
## data:  Hwt by Sex
## t = 6.5179, df = 140.61, p-value = 1.186e-09
## alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
## 95 percent confidence interval:
##  1.477352 2.763753
## sample estimates:
##   mean in group Male mean in group Female 
##            11.322680             9.202128

T-test

plot(formula = Hwt ~ Sex, data = cats)

Pipes

This is a pipe:

boys <- 
  read_sav("boys.sav") %>%
  head()

It effectively replaces head(read_sav("boys.sav")).

Why are pipes useful?

Let’s assume that we want to load data, change a variable, filter cases and select columns. Without a pipe, this could look like

boys  <- read_sav("boys.sav")
boys2 <- transform(boys, hgt = hgt / 100)
boys3 <- filter(boys2, age > 15)
boys4 <- subset(boys3, select = c(hgt, wgt, bmi))

With the pipe:

boys <-
  read_sav("boys.sav") %>%
  transform(hgt = hgt/100) %>%
  filter(age > 15) %>%
  subset(select = c(hgt, wgt, bmi))

Benefit: a single object in memory that is easy to interpret

With pipes

Your code becomes more readable:

data operations are structured from left-to-right and not from in-to-out
nested function calls are avoided
local variables and copied objects are avoided
easy to add steps in the sequence

The boys data

boys %>% head()

##     hgt  wgt   bmi
## 1 1.880 91.6 25.91
## 2 1.806 58.0 17.78
## 3 1.855 62.7 18.22
## 4 1.785 54.1 16.97
## 5 1.725 64.0 21.50
## 6 1.781 74.5 23.48

What do pipes do:

f(x) becomes x %>% f()

rnorm(10) %>% mean()

## [1] 0.2707896

f(x, y) becomes x %>% f(y)

boys %>% cor(use = "pairwise.complete.obs")

##           hgt       wgt       bmi
## hgt 1.0000000 0.6100784 0.1758781
## wgt 0.6100784 1.0000000 0.8841304
## bmi 0.1758781 0.8841304 1.0000000

h(g(f(x))) becomes x %>% f %>% g %>% h

boys %>% subset(select = wgt) %>% na.omit() %>% max()

## [1] 117.4

More pipe stuff

The standard `%>%` pipe

The `%$%` pipe

The `%T>%` pipe

The role of `.` in a pipe

In a %>% b(arg1, arg2, arg3), a will become arg1. With . we can change this.

set.seed(123)
1:5 %>%
  mean() %>%
  rnorm(10)

## [1]  9.439524  9.769823 11.558708

set.seed(123)
1:5 %>% 
  mean() %>%
  rnorm(n = 10, mean = .)

##  [1] 2.439524 2.769823 4.558708 3.070508 3.129288 4.715065 3.460916 1.734939
##  [9] 2.313147 2.554338

The . can be used as a placeholder in the pipe.

Placeholder example

Remember: sample() takes a random sample from a vector

sample(x = c(1, 1, 2, 3, 5, 8), size = 2)

## [1] 1 3

Sample 3 positions from the alphabet and show the position and the letter

set.seed(123)
1:26 %>%
  sample(3) %>%
  paste(., LETTERS[.])

## [1] "15 O" "19 S" "14 N"

Debugging pipelines

If you don’t know what’s going on, run each statement separately!

set.seed(123)
1:26

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26

set.seed(123)
1:26 %>% 
  sample(3)

## [1] 15 19 14

set.seed(123)
1:26 %>%
  sample(3) %>%
  paste(., LETTERS[.])

## [1] "15 O" "19 S" "14 N"

Performing a t-test in a pipe

cats %$%
  t.test(Hwt ~ Sex)

## 
##  Welch Two Sample t-test
## 
## data:  Hwt by Sex
## t = 6.5179, df = 140.61, p-value = 1.186e-09
## alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
## 95 percent confidence interval:
##  1.477352 2.763753
## sample estimates:
##   mean in group Male mean in group Female 
##            11.322680             9.202128

is the same as

t.test(Hwt ~ Sex, data = cats)

Storing a t-test from a pipe

cats.test <- 
  cats %$%
  t.test(Bwt ~ Sex)

cats.test

## 
##  Welch Two Sample t-test
## 
## data:  Bwt by Sex
## t = 8.7095, df = 136.84, p-value = 8.831e-15
## alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
## 95 percent confidence interval:
##  0.4177242 0.6631268
## sample estimates:
##   mean in group Male mean in group Female 
##             2.900000             2.359574

Overwriting with a pipe

boys %<>% 
  arrange(desc(bmi)) 
head(boys)

##     hgt   wgt   bmi
## 1 1.923 117.4 31.74
## 2 1.740  94.9 31.34
## 3 1.825 102.0 30.62
## 4 1.943 113.0 29.93
## 5 1.809  94.4 28.84
## 6 1.808  93.8 28.69

Visualization

Plotting

hist(): histogram
plot(): R’s plotting device
barplot(): bar plot function
boxplot(): box plot function
density(): function that calculates the density
ggplot(): ggplot’s plotting device

Why visualise?

We can process a lot of information quickly with our eyes
Plots give us information about
- Distribution / shape
- Irregularities
- Assumptions
- Intuitions
Summary statistics, correlations, parameters, model tests, p-values do not tell the whole story

ALWAYS plot your data!

Why visualise?

Source: Anscombe, F. J. (1973). “Graphs in Statistical Analysis”. American Statistician. 27 (1): 17–21.

Why visualise?

Source: https://www.autodeskresearch.com/publications/samestats

What we will do

A few plots in base graphics in R
Plotting with ggplot2 graphics

Plots

Histogram

hist(boys$hgt, main = "Histogram", xlab = "Height")

Density

dens <- density(boys$hgt, na.rm = TRUE)
plot(dens, main = "Density plot", xlab = "Height", bty = "L")

Scatter plot

plot(x = boys$hgt, y = boys$wgt, main = "Scatter plot", 
     xlab = "Height", ylab = "Weight", bty = "L")

Box plot

boxplot(boys$hgt ~ boys$reg, main = "Boxplot", 
        xlab = "Region", ylab = "Height")

Box plot II

boxplot(hgt ~ reg, boys,  main = "Boxplot", xlab = "Region", 
        ylab = "Height", lwd = 2, notch = TRUE, col = rainbow(5))

A lot can be done in base R!

boys %>% md.pattern(rotate.names = TRUE) # from mice

##     age reg wgt hgt bmi hc gen phb  tv     
## 223   1   1   1   1   1  1   1   1   1    0
## 19    1   1   1   1   1  1   1   1   0    1
## 1     1   1   1   1   1  1   1   0   1    1
## 1     1   1   1   1   1  1   0   1   0    2
## 437   1   1   1   1   1  1   0   0   0    3
## 43    1   1   1   1   1  0   0   0   0    4
## 16    1   1   1   0   0  1   0   0   0    5
## 1     1   1   1   0   0  0   0   0   0    6
## 1     1   1   0   1   0  1   0   0   0    5
## 1     1   1   0   0   0  1   1   1   1    3
## 1     1   1   0   0   0  0   1   1   1    4
## 1     1   1   0   0   0  0   0   0   0    7
## 3     1   0   1   1   1  1   0   0   0    4
##       0   3   4  20  21 46 503 503 522 1622

Many R classes have a `plot()` method

result <- lm(age~wgt, boys)
plot(result, which = 1)

Many R classes have a `plot()` method

result <- lm(age~wgt, boys)
plot(result, which = 2)

Many R classes have a `plot()` method

result <- lm(age~wgt, boys)
plot(result, which = 5)

Neat! But what if we want more control?

ggplot2

What is `ggplot2`?

Layered plotting based on the book The Grammer of Graphics by Leland Wilkinsons.

With ggplot2 you

provide the data
define how to map variables to aesthetics
state which geometric object to display
(optional) edit the overall theme of the plot

ggplot2 then takes care of the details

An example: scatterplot

1: Provide the data

boys %>%
  ggplot()

2: map variable to aesthetics

boys %>%
  ggplot(aes(x = age, y = bmi))

3: state which geometric object to display

boys %>%
  ggplot(aes(x = age, y = bmi)) +
  geom_point()

An example: scatterplot

Why this syntax?

Create the plot

gg <- 
  boys %>%
  ggplot(aes(x = age, y = bmi)) +
  geom_point(col = "dark green")

Add another layer (smooth fit line)

gg <- gg + 
  geom_smooth(col = "dark blue")

Give it some labels and a nice look

gg <- gg + 
  labs(x = "Age", y = "BMI", title = "BMI trend for boys") +
  theme_minimal()

Why this syntax?

plot(gg)

Why this syntax?

Aesthetics

x
y
size
colour
fill
opacity (alpha)
linetype
…

Aesthetics

gg <- 
  boys %>% 
  filter(!is.na(reg)) %>% 
  
  ggplot(aes(x      = age, 
             y      = bmi, 
             size   = hc, 
             colour = reg)) +
  
  geom_point(alpha = 0.5) +
  
  labs(title  = "BMI trend for boys",
       x      = "Age", 
       y      = "BMI", 
       size   = "Head circumference",
       colour = "Region") +
  theme_minimal()

Aesthetics

plot(gg)

Geoms

geom_point
geom_bar
geom_line
geom_smooth
geom_histogram
geom_boxplot
geom_density

This lecture

Procedural stuff

Course pages

Course overview

Team

Topics

Course Setup

Groups

Introduction to SLV

Introduction SLV

Some example questions

Goals in data analysis

Modes in data analysis

Some examples

In this course

Exploratory Data Analysis workflow

Data analysis

Explanation

Some examples

How wages differ

The origin of cholera

Predicting the outcome of elections

Google Flu Trends

Identifying Brontës from Austen

Measle Vaccines

Data wrangling

Wrangling in the pipeline

Core ideas

Data Manipulation

R

We use the following packages

loading packages

Data manipulation

The cats data

How to select only Female cats?

How to select only heavy cats?

How to select only heavy cats?

more flexible: dplyr

Working with factors

Working with factors

Releveling factors

Sorting

Better sorting: arrange()

Better sorting: arrange()

Better sorting: arrange()

Basic analyses

Correlation

Correlation

Correlation

T-test

T-test

Pipes

This is a pipe:

Why are pipes useful?

With pipes

The boys data

What do pipes do:

More pipe stuff

The standard %>% pipe

The %$% pipe

The %T>% pipe

The role of . in a pipe

Placeholder example

Debugging pipelines

Performing a t-test in a pipe

Storing a t-test from a pipe

Overwriting with a pipe

Visualization

Plotting

Why visualise?

ALWAYS plot your data!

Why visualise?

Why visualise?

What we will do

Plots

Histogram

Density

Scatter plot

Box plot

Box plot II

more flexible: `dplyr`

Better sorting: `arrange()`

Better sorting: `arrange()`

Better sorting: `arrange()`

The standard `%>%` pipe

The `%$%` pipe

The `%T>%` pipe

The role of `.` in a pipe

Many R classes have a `plot()` method

Many R classes have a `plot()` method

Many R classes have a `plot()` method

What is `ggplot2`?