Chapter 3

As people who work in “data science” note, about 90% of data science is getting data into a form that can be analyzed. In this chapter we will work with some tools that help with data wrangling.

Chapter 3 is an introduction to the use of Wickham’s dplyr which is a major part of the tidyverse. All of the things that dplyr helps us do can be carried out using base R — after all, dplyr is written in R. However, the formatting of commands when using dplyr makes the wrangling process far less painful.

Prerequisites

To recreate the examples presented in the text we need to load the package tidyverse that contains the dplyr package and the flight data from the nycflights13 package.

#detach(package:nycflights13)
  p_load(nycflights13)
#detach(package:tidyverse)
  p_load(tidyverse)

Beware of conflicts. As noted above, the stats::filter and stats::lag functions are “replaced” by the dplyr versions. This is not a problem if we understand that the functions that are masked are no longer the default functions.

nycflights13

We now have access to the flights data that is part of the nycflights13 package. The flights data frame contains 19 variables on 336776 flights that took place out of NY City in 2013. To see what is in the flights data frame we enter its name.

  flights
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Note that this is a “tibble” which is a tweaked data frame. It is a good thing too. If this had been a simple data frame, we would have had a problem with it trying to print all of the variables and observations. Maybe a safer method would have been to use the base R head.

  head(flights)
## # A tibble: 6 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # ... with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

The listing of variables includes variable types — something you dealt with in CS 110. The text says that there are chr, dbl, int, and dttm. We will also see lgl, fctr, and date. Instead of dttm we see S3 POSIXct which is another (more flexible) date-time format — wait for lubridate.

Since R is object oriented, it treats types well/appropriately. We can override the type.

  ### Create local copy 
  dep_delay <- flights$dep_delay
  ### Determine type
  storage.mode(dep_delay)
## [1] "double"
  ### Compute summary statistics
  summary(dep_delay)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -43.00   -5.00   -2.00   12.64   11.00 1301.00    8255
  ### Convert to integer
  storage.mode(dep_delay) <- "integer"
  ### Check conversion
  storage.mode(dep_delay)
## [1] "integer"
  ### Compute summary statistics
  summary(dep_delay)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -43.00   -5.00   -2.00   12.64   11.00 1301.00    8255
  ### Remove local copy
  rm(dep_delay)

dplyr

dplyr is SQL similar. The basic functions presented in the text will take care of many of the problems that we encounter. Used with group_by and a few “joins” we should be set.

Filtering rows — filter()

The dplyr function filter subsets a data frame (or tibble) by selecting only those observations that satisfy a set of constraints. The arguments to filter are a data frame followed by comparison expressions (booleans).

We can subset the flight tibble by date. Pick a birthday, say my dog Oliver’s — July 5. We can look to see which flights took place on 7/5.

  filter(flights, month == 7, day == 5)
## # A tibble: 822 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     7     5       38           2359        39      416            350
##  2  2013     7     5      106           2245       141      336            135
##  3  2013     7     5      500            500         0      633            640
##  4  2013     7     5      531            536        -5      804            806
##  5  2013     7     5      538            540        -2      836            840
##  6  2013     7     5      538            545        -7      812            816
##  7  2013     7     5      542            545        -3      954            921
##  8  2013     7     5      553            600        -7      835            851
##  9  2013     7     5      553            600        -7      635            655
## 10  2013     7     5      554            600        -6      821            834
## # ... with 812 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Note that filter responds by retuning a tibble with 822 rows and 19 columns. While this is nice, we tend to want to store a copy, so we should assign the result.

  ### Just assign the subset to jul5
  flights_jul5 <- filter(flights, month == 7, day == 5)
  dim(flights_jul5)
## [1] 822  19
  ### Assign and list the subset using surrounding parentheses
  (flights_jul5 <- filter(flights, month == 7, day == 5))
## # A tibble: 822 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     7     5       38           2359        39      416            350
##  2  2013     7     5      106           2245       141      336            135
##  3  2013     7     5      500            500         0      633            640
##  4  2013     7     5      531            536        -5      804            806
##  5  2013     7     5      538            540        -2      836            840
##  6  2013     7     5      538            545        -7      812            816
##  7  2013     7     5      542            545        -3      954            921
##  8  2013     7     5      553            600        -7      835            851
##  9  2013     7     5      553            600        -7      635            655
## 10  2013     7     5      554            600        -6      821            834
## # ... with 812 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
  rm(flights_jul5)

Comparisons

The text discusses comparison operators. As in CS 110, the typical R comparisons are:

  • == for equality (note two equal signs for comparison rather than assignment)
  • <
  • <=
  • >
  • >=
  • !=

Logical Operators

The logical operators that R uses are similar to those you saw in CS 110.

  • & and
  • | or
  • ! not

Combining “not”, “and”, and “or” allows us to subset using just a few statements.

The book discusses DeMorgan’s law. In the lingo of R we can express this as !(A & B) = (!A) | (!B) = !A | !B. Alternatively, we get !(A | B) = !A & !B.

Missing Values

Like it or not, observations often contain missing values. In R these are indicated by NA. NAs can be created during import or through computation. Note that NAs are not the same as Inf or a NaN.

  x <- NA
  x
## [1] NA
  3*x + 2
## [1] NA
  y <- NA
  x < y
## [1] NA
  x == y
## [1] NA
  1/0
## [1] Inf
  log(-1)
## Warning in log(-1): NaNs produced
## [1] NaN

Filters typically drop observations where the comparison evaluates to FALSE or NA. You can keep NAs by using is.na().

  x <- c(1, NA, 4, 3)
  y <- c("b", "a", NA, "d")
  (dat <- tibble(x,y))
## # A tibble: 4 x 2
##       x y    
##   <dbl> <chr>
## 1     1 b    
## 2    NA a    
## 3     4 <NA> 
## 4     3 d
  rm(x, y)
  filter(dat, x > 1)
## # A tibble: 2 x 2
##       x y    
##   <dbl> <chr>
## 1     4 <NA> 
## 2     3 d
  filter(dat, is.na(x) | is.na(y))
## # A tibble: 2 x 2
##       x y    
##   <dbl> <chr>
## 1    NA a    
## 2     4 <NA>
  filter(dat, !is.na(x) & !is.na(y))
## # A tibble: 2 x 2
##       x y    
##   <dbl> <chr>
## 1     1 b    
## 2     3 d

Arranging Rows — arrange()

Base R has sort and order that can be used to rearrange the rows of a data frame or tibble. The tidyverse contains the arrange function which makes it easiser to reorder rows without losing order in columns. The desc() option

  arrange(dat, x)
## # A tibble: 4 x 2
##       x y    
##   <dbl> <chr>
## 1     1 b    
## 2     3 d    
## 3     4 <NA> 
## 4    NA a
  arrange(dat, desc(y))
## # A tibble: 4 x 2
##       x y    
##   <dbl> <chr>
## 1     3 d    
## 2     1 b    
## 3    NA a    
## 4     4 <NA>
  rm(dat)

Note that NA is ordered as last.

Selecting Columns — select()

The select function allows us to subset a dataset by selecting those variables in which we are interested. In larger datasets, this can help us use storage more efficiently.

Note: filter lets us reduce row dimension and select lets us reduce column dimension.

  ### Make some data
  x <- rep(1:20, each=3)
  x <- matrix(x, ncol=20)
  colnames(x) <- letters[1:20]
  colnames(x)
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t"
  x <- data.frame(x)
  
  ### Now play with subsetting of columns
  ### 
  ### We can select by naming columns
  select(x, a, d, c, g)
##   a d c g
## 1 1 4 3 7
## 2 1 4 3 7
## 3 1 4 3 7
  ### Or by listing column values
  select(x, 1, 4, 3, 7)
##   a d c g
## 1 1 4 3 7
## 2 1 4 3 7
## 3 1 4 3 7
  ### Alternative listing
  indices <- c(1, 4, 3, 7)
  select(x, indices)
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(indices)` instead of `indices` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
##   a d c g
## 1 1 4 3 7
## 2 1 4 3 7
## 3 1 4 3 7
  rm(indices)

A really nice feature of R is the ability to exclude rows and columns. select allows us to use this feature.

  indices <- 4:17
  select(x, -indices)
##   a b c  r  s  t
## 1 1 2 3 18 19 20
## 2 1 2 3 18 19 20
## 3 1 2 3 18 19 20
  ### Or..
  select(x, -4:17)
## Warning in x:y: numerical expression has 19 elements: only the first used
##   a b c d e f g h i  j  k  l  m  n  o  p  q
## 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
  ### Oops, note that
  -4:17
##  [1] -4 -3 -2 -1  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17
  ### so try
  select(x, -(4:17))
##   a b c  r  s  t
## 1 1 2 3 18 19 20
## 2 1 2 3 18 19 20
## 3 1 2 3 18 19 20
  rm(indices)

There are some interesting helper functions that work with select. These functions will be useful when we have data that are repeated measurements

  ### Rename the columns to show selection methods
  colnames(x) <- paste0(rep(c("X","Y","XX","YY"), each=5), rep(c(1:5), 4))
  colnames(x)
##  [1] "X1"  "X2"  "X3"  "X4"  "X5"  "Y1"  "Y2"  "Y3"  "Y4"  "Y5"  "XX1" "XX2"
## [13] "XX3" "XX4" "XX5" "YY1" "YY2" "YY3" "YY4" "YY5"
  ### Now select columns by pattern
  select(x, starts_with("X"))
##   X1 X2 X3 X4 X5 XX1 XX2 XX3 XX4 XX5
## 1  1  2  3  4  5  11  12  13  14  15
## 2  1  2  3  4  5  11  12  13  14  15
## 3  1  2  3  4  5  11  12  13  14  15
  select(x, starts_with("YY"))
##   YY1 YY2 YY3 YY4 YY5
## 1  16  17  18  19  20
## 2  16  17  18  19  20
## 3  16  17  18  19  20
  select(x, ends_with("1"))
##   X1 Y1 XX1 YY1
## 1  1  6  11  16
## 2  1  6  11  16
## 3  1  6  11  16
  select(x, num_range("Y", c(1,3,5)))
##   Y1 Y3 Y5
## 1  6  8 10
## 2  6  8 10
## 3  6  8 10

The use of matches will be discussed later.

Rename

rename is a “variant” of select that does not effect other column/variable names. While we can rename variables as we did above, the use of rename removes the need for keeping track of the indices of the columns.

  ### Change the names of the 5th observations the old school way
  cnames <- colnames(x)
  names(x)[c(5,10,15,20)] <- c("W1","W2","W3","W4")
  names(x)
##  [1] "X1"  "X2"  "X3"  "X4"  "W1"  "Y1"  "Y2"  "Y3"  "Y4"  "W2"  "XX1" "XX2"
## [13] "XX3" "XX4" "W3"  "YY1" "YY2" "YY3" "YY4" "W4"
  ### Reset the columnames
  (colnames(x) <- cnames)
##  [1] "X1"  "X2"  "X3"  "X4"  "X5"  "Y1"  "Y2"  "Y3"  "Y4"  "Y5"  "XX1" "XX2"
## [13] "XX3" "XX4" "XX5" "YY1" "YY2" "YY3" "YY4" "YY5"
  ### Change the names of the 3rd observations the rename way
  rename(x, A1 = X3, A2 = Y3, A3 = XX3, A4 = YY3)
##   X1 X2 A1 X4 X5 Y1 Y2 A2 Y4 Y5 XX1 XX2 A3 XX4 XX5 YY1 YY2 A4 YY4 YY5
## 1  1  2  3  4  5  6  7  8  9 10  11  12 13  14  15  16  17 18  19  20
## 2  1  2  3  4  5  6  7  8  9 10  11  12 13  14  15  16  17 18  19  20
## 3  1  2  3  4  5  6  7  8  9 10  11  12 13  14  15  16  17 18  19  20
  names(x)
##  [1] "X1"  "X2"  "X3"  "X4"  "X5"  "Y1"  "Y2"  "Y3"  "Y4"  "Y5"  "XX1" "XX2"
## [13] "XX3" "XX4" "XX5" "YY1" "YY2" "YY3" "YY4" "YY5"
  rm(cnames, x)

Mutate

The mutate function allows us to create new variables. This, too, can be done using base R. However, mutate shortens the names of the variables and adds to readability.

We first look at base R.

  ### Make some data
  x <- c(1, NA, 4, 3)
  y <- c("b", "a", NA, "d")
  (dat <- tibble(x,y))
## # A tibble: 4 x 2
##       x y    
##   <dbl> <chr>
## 1     1 b    
## 2    NA a    
## 3     4 <NA> 
## 4     3 d
  rm(x, y)
  ### Creat a new variable that is the concatenation of y and x
  dat$logx <- log(dat$x)
  dat$x2 <- dat$x^2
  dat$y1 <- lag(dat$y)
  dat$divx <- 47 %/% dat$x
  dat$modx <- 47 %% dat$x
  dat$xeven <- (dat$x %% 2) == 0
  dat$xcumsum <- cumsum(dat$x)
  dat
## # A tibble: 4 x 9
##       x y      logx    x2 y1     divx  modx xeven xcumsum
##   <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <lgl>   <dbl>
## 1     1 b      0        1 <NA>     47     0 FALSE       1
## 2    NA a     NA       NA b        NA    NA NA         NA
## 3     4 <NA>   1.39    16 a        11     3 TRUE       NA
## 4     3 d      1.10     9 <NA>     15     2 FALSE      NA

**mutate shortens the above.

  ### Retrieve the original dat
  (dat <- select(dat, x, y))
## # A tibble: 4 x 2
##       x y    
##   <dbl> <chr>
## 1     1 b    
## 2    NA a    
## 3     4 <NA> 
## 4     3 d
  mutate(dat,
           lnx = log(x),
           x2 = x^2,
           y1 = lag(y),
           modx = 47 %/% x,
           divx = 47 %% x,
           xeven = (x %% 2) == 0,
           xcumsum = cumsum(x)
        )
## # A tibble: 4 x 9
##       x y       lnx    x2 y1     modx  divx xeven xcumsum
##   <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <lgl>   <dbl>
## 1     1 b      0        1 <NA>     47     0 FALSE       1
## 2    NA a     NA       NA b        NA    NA NA         NA
## 3     4 <NA>   1.39    16 a        11     3 TRUE       NA
## 4     3 d      1.10     9 <NA>     15     2 FALSE      NA
  rm(dat)

summarize

As indicated in the text, summarize is really only helpful when used with group — which is explained in a later section. For now, we note that summarize allows us to get summary information from a tibble or data frame.

The text says that it will cover na.rm=TRUE soon. The short description is that rm refers to remove, and that any observation that is NA will be removed from the computation. Note that this means it is possible that the effective sample size has been reduced.

  ### Create some data
  set.seed(47)
  z <- rnorm(1000, 0, 1)
  xbar <- rep(-4:5, each=100)
  s <- rep(1:5, 200)
  y <- z * s
  x <- y + xbar
  dat <- tibble(x, y, z, xbar, s)
  ### Find the means of all three variables
  summarize(dat, xmean = mean(x, na.rm=TRUE), ymean = mean(y, na.rm=TRUE), zmean = mean(z, na.rm=TRUE))
## # A tibble: 1 x 3
##   xmean  ymean  zmean
##   <dbl>  <dbl>  <dbl>
## 1 0.558 0.0584 0.0279
  ### Find mean, std dev, and n for x then z
  summarize(dat, xmean = mean(x, na.rm=TRUE), xsd = sd(x, na.rm=TRUE), n=n())
## # A tibble: 1 x 3
##   xmean   xsd     n
##   <dbl> <dbl> <int>
## 1 0.558  4.39  1000
  summarize(dat, zmean = mean(z, na.rm=TRUE), zsd = sd(z, na.rm=TRUE), n=n())
## # A tibble: 1 x 3
##    zmean   zsd     n
##    <dbl> <dbl> <int>
## 1 0.0279  1.05  1000
  rm(z, s, y, x, xbar)

As noted above, we can subset by group.

  ### Group the dat data by mean and std dev
  dat_by_xbar_s <- group_by(dat, xbar, s)
  ### Compute grouped means for x, y, and z
  summarize(dat_by_xbar_s, xmean = mean(x, na.rm=TRUE), ymean = mean(y, na.rm=TRUE), zmean = mean(z, na.rm=TRUE))
## `summarise()` has grouped output by 'xbar'. You can override using the `.groups` argument.
## # A tibble: 50 x 5
## # Groups:   xbar [10]
##     xbar     s  xmean   ymean    zmean
##    <int> <int>  <dbl>   <dbl>    <dbl>
##  1    -4     1 -3.94   0.0627  0.0627 
##  2    -4     2 -3.83   0.166   0.0829 
##  3    -4     3 -3.38   0.621   0.207  
##  4    -4     4 -4.39  -0.388  -0.0970 
##  5    -4     5 -3.99   0.0102  0.00204
##  6    -3     1 -3.05  -0.0504 -0.0504 
##  7    -3     2 -2.69   0.315   0.157  
##  8    -3     3 -2.90   0.0985  0.0328 
##  9    -3     4 -0.540  2.46    0.615  
## 10    -3     5 -1.79   1.21    0.241  
## # ... with 40 more rows
  ### Compute grouped sds for x, y, and z
  summarize(dat_by_xbar_s, xs = sd(x, na.rm=TRUE), ys = sd(y, na.rm=TRUE), zs = sd(z, na.rm=TRUE))
## `summarise()` has grouped output by 'xbar'. You can override using the `.groups` argument.
## # A tibble: 50 x 5
## # Groups:   xbar [10]
##     xbar     s    xs    ys    zs
##    <int> <int> <dbl> <dbl> <dbl>
##  1    -4     1 1.13  1.13  1.13 
##  2    -4     2 1.62  1.62  0.809
##  3    -4     3 3.22  3.22  1.07 
##  4    -4     4 4.61  4.61  1.15 
##  5    -4     5 3.77  3.77  0.754
##  6    -3     1 0.775 0.775 0.775
##  7    -3     2 2.45  2.45  1.22 
##  8    -3     3 3.44  3.44  1.15 
##  9    -3     4 3.26  3.26  0.815
## 10    -3     5 4.46  4.46  0.892
## # ... with 40 more rows
  ### Now look at things for "fun"
  dat_by_xbar <- group_by(dat, xbar)
  summarize(dat_by_xbar, xmean = mean(x, na.rm=TRUE), ymean = mean(y, na.rm=TRUE), zmean = mean(z, na.rm=TRUE))
## # A tibble: 10 x 4
##     xbar  xmean    ymean   zmean
##  * <int>  <dbl>    <dbl>   <dbl>
##  1    -4 -3.91   0.0942   0.0515
##  2    -3 -2.19   0.806    0.199 
##  3    -2 -1.66   0.336    0.135 
##  4    -1 -1.04  -0.0416  -0.0200
##  5     0  0.222  0.222    0.0283
##  6     1  0.665 -0.335   -0.0885
##  7     2  1.91  -0.0940  -0.0335
##  8     3  3.14   0.141    0.0649
##  9     4  4.01   0.00955  0.0111
## 10     5  4.45  -0.554   -0.0685
  summarize(dat_by_xbar, xs = sd(x, na.rm=TRUE), ys = sd(y, na.rm=TRUE), zs = sd(z, na.rm=TRUE))
## # A tibble: 10 x 4
##     xbar    xs    ys    zs
##  * <int> <dbl> <dbl> <dbl>
##  1    -4  3.11  3.11 0.983
##  2    -3  3.21  3.21 0.995
##  3    -2  3.27  3.27 1.04 
##  4    -1  3.62  3.62 1.12 
##  5     0  3.68  3.68 0.982
##  6     1  3.76  3.76 1.06 
##  7     2  3.08  3.08 1.03 
##  8     3  3.87  3.87 1.13 
##  9     4  3.42  3.42 1.01 
## 10     5  3.89  3.89 1.13
  dat_by_s <- group_by(dat, s)
  summarize(dat_by_s, xmean = mean(x, na.rm=TRUE), ymean = mean(y, na.rm=TRUE), zmean = mean(z, na.rm=TRUE))
## # A tibble: 5 x 4
##       s   xmean   ymean   zmean
## * <int>   <dbl>   <dbl>   <dbl>
## 1     1  0.436  -0.0639 -0.0639
## 2     2  0.830   0.330   0.165 
## 3     3  0.555   0.0548  0.0183
## 4     4  1.02    0.522   0.131 
## 5     5 -0.0510 -0.551  -0.110
  summarize(dat_by_s, xs = sd(x, na.rm=TRUE), ys = sd(y, na.rm=TRUE), zs = sd(z, na.rm=TRUE)) 
## # A tibble: 5 x 4
##       s    xs    ys    zs
## * <int> <dbl> <dbl> <dbl>
## 1     1  2.97 0.975 0.975
## 2     2  3.75 2.19  1.09 
## 3     3  4.20 3.11  1.04 
## 4     4  5.03 4.25  1.06 
## 5     5  5.49 5.25  1.05
  summarize(dat_by_xbar_s, n=n())
## `summarise()` has grouped output by 'xbar'. You can override using the `.groups` argument.
## # A tibble: 50 x 3
## # Groups:   xbar [10]
##     xbar     s     n
##    <int> <int> <int>
##  1    -4     1    20
##  2    -4     2    20
##  3    -4     3    20
##  4    -4     4    20
##  5    -4     5    20
##  6    -3     1    20
##  7    -3     2    20
##  8    -3     3    20
##  9    -3     4    20
## 10    -3     5    20
## # ... with 40 more rows
  rm(dat, dat_by_xbar_s, dat_by_xbar, dat_by_s) 

Piping — how to combine multiple operations

Because of the complexity of the flow of the text’s example, I am going to use the book’s example. So, we will look at flight delays.

Without piping, we could look at the relationship between delay and dest using the following code.

  by_dest <- group_by(flights, dest)
  delay <- summarize(by_dest, 
                      count = n(),
                      dist = mean(distance, na.rm=TRUE),
                      delay = mean(arr_delay, na.rm=TRUE)
                   )
  (delay <- filter(delay, count>20, dest != "HNL"))
## # A tibble: 96 x 4
##    dest  count  dist delay
##    <chr> <int> <dbl> <dbl>
##  1 ABQ     254 1826   4.38
##  2 ACK     265  199   4.85
##  3 ALB     439  143  14.4 
##  4 ATL   17215  757. 11.3 
##  5 AUS    2439 1514.  6.02
##  6 AVL     275  584.  8.00
##  7 BDL     443  116   7.05
##  8 BGR     375  378   8.03
##  9 BHM     297  866. 16.9 
## 10 BNA    6333  758. 11.8 
## # ... with 86 more rows

Noting that delays seem to decrease as flight distances increase, we might decide to plot the data — although the graph is the easier way to look for trends.

  ggplot(data=delay, mapping=aes(x=dist, y=delay)) +
    geom_point(aes(size=count), alpha=1/3) +  # alpha controls transparency
    geom_smooth(se=TRUE) +
    theme_bw()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Okay, we missed that the shorter flights do pretty well too.

The book notes that there are three basic steps to the above process: 1. Group obs by destination. 2. Compute average distance, average delay, and number of flights. 3. Drop Honolulu (HNL) which does not have the same traffic as a “normal” airport, and infrequent flights (<=20). While the approach of using intermediate tibbles/data frames is effective, the code needed to generate the final result is not as easy to read as it could be. This is where piping comes in.

Using dplyr we can pipe (%>%) intermediate results to the next opperation. The three steps from above now look like:

  ### Check for the intermediate data frames from previous computations
  ls()
## [1] "by_dest" "delay"
  rm(delay, by_dest)
  ls()
## character(0)
  delay <- flights %>%       # Start with the flights data frame
    group_by(dest) %>%          # 1. Group by dest
    summarize(               # 2. Compute avg distance and delay, num flights
      count = n(),
      dist = mean(distance, na.rm=TRUE),
      delay = mean(arr_delay, na.rm=TRUE)
    ) %>%
    filter(count > 20, dest != "HNL")     # 3. Drop "HNL" and infrequent flights
  
  ls()
## [1] "delay"

Note that delay and by_dest were still sitting around from our previous computations. We remove them and then recompute using pipes which leaves no intermediate data frame (by_dest).

The new code has a flow to it that is easy to read. Using the text’s suggestion of replacing “%>%” with “then” makes the code read like a (run-on) sentence. “Let delay equal flights, then summarize…, then filter…”

Plotting could be done using ggplot2 on delay in the same way as before, or we can use the new ggvis package. Note the use of the pipe (%>%) with ggvis instead of the plus sign (+) used with ggplot2. And, since ggvis uses pipes, there are no residual tibbles sitting around.

  ### Install or load ggvis
  if (!require(ggvis)){      # Install/load ggvis
    install.packages("ggvis", dependencies=TRUE)
    require(ggvis)
  }
## Loading required package: ggvis
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'ggvis'
## Installing package into 'C:/Users/School/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)
## Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
  ls()
## [1] "delay"
  rm(delay)
  ls()
## character(0)
  flights %>%       # Start with the flights data frame
    group_by(dest) %>%          # 1. Group by dest
    summarize(                  # 2. Compute avg distance and delay, num flights
      count = n(),
      dist = mean(distance, na.rm=TRUE),
      delay = mean(arr_delay, na.rm=TRUE)
    ) %>%
    filter(count > 20, dest != "HNL") %>%     # 3. Drop "HNL" and infrequent flights
    ggvis(~dist, ~delay) %>%    # Use ggvis to plot the piped data
      layer_points(size = ~count, 
                   opacity := 1/3  # Set opacity statically or use
                   #, opacity := input_slider(0, 1, value = 1/3, label = "Opacity")
                   ) %>%
      layer_smooths(se = TRUE, stroke := "skyblue") %>%
      add_legend("size", 
                 title = "Number of Arrivals", 
                 orient = "right"
                )
## Error in add_legend(., "size", title = "Number of Arrivals", orient = "right"): could not find function "add_legend"
  ls()
## character(0)

For more on ggvis see https://ggvis.rstudio.com/, http://jimhester.github.io/ggplot2ToGgvis/ and https://rpubs.com/jacobdickey_2016/223060. Remember that ggvis is currently in beta. At this point you are probably better off using ggplot2.

Missing Values

When we gather data from various sources and combine these data, we often generate a lot of missing values. While it is possible to impute these values, we usually ignore them — by tossing the entire observation.

We now take a look at the na.rm option. Reusing the data from our earlier discussion of NA we can look at how aggregation functions treat missing values.

  ### Build a test tibble
  x <- c(1, NA, 4, 3)
  y <- c("b", "a", NA, "d")
  (dat <- tibble(x,y))
## # A tibble: 4 x 2
##       x y    
##   <dbl> <chr>
## 1     1 b    
## 2    NA a    
## 3     4 <NA> 
## 4     3 d
  rm(x, y)
  
  ### Compute the mean of the numeric variable, x
  mean(dat$x, na.rm=FALSE)
## [1] NA
  mean(dat$x, na.rm=TRUE)
## [1] 2.666667
  rm(dat)

Note that computing with NA’s does not generate an error. Having seen that arithmetic on NA’s generates NA’s, we should not be surprised that mean returns NA’s when the variable has NA’s in it. It is nice that na.rm=TRUE makes it possible for mean to return a value. However, as the book points out, we need to be careful about the number of observations that remain.

Now that we understand the how na.rm affects aggregation functions, we can move on to its use in summerize.

  flights %>%
    group_by(year, month, day) %>%
    summarize(mean = mean(dep_delay))
## `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
## # A tibble: 365 x 4
## # Groups:   year, month [12]
##     year month   day  mean
##    <int> <int> <int> <dbl>
##  1  2013     1     1    NA
##  2  2013     1     2    NA
##  3  2013     1     3    NA
##  4  2013     1     4    NA
##  5  2013     1     5    NA
##  6  2013     1     6    NA
##  7  2013     1     7    NA
##  8  2013     1     8    NA
##  9  2013     1     9    NA
## 10  2013     1    10    NA
## # ... with 355 more rows

the default is na.rm=FALSE, we end up with a lot of NA’s. Use of na.rm=TRUE gives what appears to be more meaningful results.

  flights %>%
    group_by(year, month, day) %>%
    summarize(mean = mean(dep_delay, na.rm=TRUE))
## `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
## # A tibble: 365 x 4
## # Groups:   year, month [12]
##     year month   day  mean
##    <int> <int> <int> <dbl>
##  1  2013     1     1 11.5 
##  2  2013     1     2 13.9 
##  3  2013     1     3 11.0 
##  4  2013     1     4  8.95
##  5  2013     1     5  5.73
##  6  2013     1     6  7.15
##  7  2013     1     7  5.42
##  8  2013     1     8  2.55
##  9  2013     1     9  2.28
## 10  2013     1    10  2.84
## # ... with 355 more rows

So, which observations were used in the computation of the grouped means? Use of the is.na function can help us figure this out.

  ### Subset the flights to those that were not cancelled and save for later
  not_cancelled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay))

  ### Note that the means suggest that these are the same flights as were used above
  not_cancelled %>%
    group_by(year, month, day) %>%
    summarize(mean = mean(dep_delay))  # Note that the variable "mean"" is being created
## `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
## # A tibble: 365 x 4
## # Groups:   year, month [12]
##     year month   day  mean
##    <int> <int> <int> <dbl>
##  1  2013     1     1 11.4 
##  2  2013     1     2 13.7 
##  3  2013     1     3 10.9 
##  4  2013     1     4  8.97
##  5  2013     1     5  5.73
##  6  2013     1     6  7.15
##  7  2013     1     7  5.42
##  8  2013     1     8  2.56
##  9  2013     1     9  2.30
## 10  2013     1    10  2.84
## # ... with 355 more rows
  ### Figure out how many observations there are in each data frame
  dim(flights)
## [1] 336776     19
  dim(not_cancelled)
## [1] 327346     19

Having removed all of the NA’s, use of na.rm=FALSE is no longer a problem. Note that we have dropped 9430 observations because of missing values. We might wonder how these missing values are distributed.

Counts

The use of na.rm=TRUE or filtering using !is.na removes all of the observations with NA’s. If there is something systematic (non-random) in the way the NA’s were generated, then removal can greatly change our interpretation of the data. A first step in figuring out what is going on is looking at the number of missing and nonmissing values in the data set.

We continue looking at the example of not cancelled flights that we started above.

  ### Subset the flights to those that were not cancelled and save for later
  not_cancelled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay))

  delays <- not_cancelled %>%
    group_by(tailnum) %>%
    summarize(delay = mean(dep_delay)) # Note that the variable "delay" is being created
  
  ggplot(data = delays, mapping = aes(x = delay)) +
    geom_freqpoly(binwidth = 10) + 
    theme_bw()

While there are not many, there are a few flights that have an average delay of over two hours. The reason for this is more logical than we might originally think. How would a single flight that is delayed for three hours be indicated on the plot? Single flight failures can screw up our interpretation. As noted in the text, a plot of number of flights by mean delay.

  delays <- not_cancelled %>%
    group_by(tailnum) %>%
    summarize(
      delay = mean(dep_delay), # The variable "delay" is created
      n = n()                  # n is the number of flights
    )
  
  ggplot(data = delays, mapping = aes(x = n, y = delay)) +
    geom_point(alpha = 1/10) +
    theme_bw()

We note the clear heteroscedasticity (lack of constant variability) across n. In fact, as n increases, variability decreases. This is something that is covered in most introductory statitstics courses where we learn that \(\widehat{\sigma_{\bar{X}}} = s / \sqrt{n}\). So, if we resample larger samples, our sample means will differ from each other less than if we resample smaller samples.

The text suggests that we toss out flights that do not occur on a regular basis — say fewer than 25 flights per year. If we replot the above graph we get.

  delays %>%
    filter(n > 25) %>%
    ggplot(mapping = aes(x = n, y = delay)) +
      geom_point(alpha = 1/10) +
      theme_bw()

The problem of decreasing variability is still appparent, but less drastic than before. What do the negative mean delays suggest?

Lahman data

The text’s next example is an excuse to use the Lahman baseball data. The lahman package gives access to seasonal statistics for MLB players and teams. For more information on analyzing MLB data see Jim Albert’s Teaching Statistics Using Baseball or Marchi and Albert’s Analyzing Baseball Data with R.

We start by changing the Batting data frame and its tibble equivalent.

  p_load(Lahman)
  
  is.data.frame(Lahman::Batting)
## [1] TRUE
  head(Lahman::Batting)
##    playerID yearID stint teamID lgID  G  AB  R  H X2B X3B HR RBI SB CS BB SO
## 1 abercda01   1871     1    TRO   NA  1   4  0  0   0   0  0   0  0  0  0  0
## 2  addybo01   1871     1    RC1   NA 25 118 30 32   6   0  0  13  8  1  4  0
## 3 allisar01   1871     1    CL1   NA 29 137 28 40   4   5  0  19  3  1  2  5
## 4 allisdo01   1871     1    WS3   NA 27 133 28 44  10   2  2  27  1  1  0  2
## 5 ansonca01   1871     1    RC1   NA 25 120 29 39  11   3  0  16  6  2  2  1
## 6 armstbo01   1871     1    FW1   NA 12  49  9 11   2   1  0   5  0  1  0  1
##   IBB HBP SH SF GIDP
## 1  NA  NA NA NA    0
## 2  NA  NA NA NA    0
## 3  NA  NA NA NA    1
## 4  NA  NA NA NA    0
## 5  NA  NA NA NA    0
## 6  NA  NA NA NA    0
  (batting <- as_tibble(Lahman::Batting))
## # A tibble: 107,429 x 22
##    playerID yearID stint teamID lgID      G    AB     R     H   X2B   X3B    HR
##    <chr>     <int> <int> <fct>  <fct> <int> <int> <int> <int> <int> <int> <int>
##  1 abercda~   1871     1 TRO    NA        1     4     0     0     0     0     0
##  2 addybo01   1871     1 RC1    NA       25   118    30    32     6     0     0
##  3 allisar~   1871     1 CL1    NA       29   137    28    40     4     5     0
##  4 allisdo~   1871     1 WS3    NA       27   133    28    44    10     2     2
##  5 ansonca~   1871     1 RC1    NA       25   120    29    39    11     3     0
##  6 armstbo~   1871     1 FW1    NA       12    49     9    11     2     1     0
##  7 barkeal~   1871     1 RC1    NA        1     4     0     1     0     0     0
##  8 barnero~   1871     1 BS1    NA       31   157    66    63    10     9     0
##  9 barrebi~   1871     1 FW1    NA        1     5     1     1     1     0     0
## 10 barrofr~   1871     1 BS1    NA       18    86    13    13     2     1     0
## # ... with 107,419 more rows, and 10 more variables: RBI <int>, SB <int>,
## #   CS <int>, BB <int>, SO <int>, IBB <int>, HBP <int>, SH <int>, SF <int>,
## #   GIDP <int>

Since we are interested in the batters themselves, we compute individual career batting averages. Note that PlayerID is a unique identifier for every person who has played in MLB.

  batters <- batting %>%
    group_by(playerID) %>%
    summarize(
      ba = sum(H, na.rm=TRUE)/sum(AB, na.rm=TRUE), # H is the number of hits
      ab = sum(AB, na.rm=TRUE)                     # AB is the number of at bats
    )

  ### Plot those batters with more than 100 career at bats
  batters %>%
    filter (ab > 100) %>%
    ggplot(mapping = aes(x = ab, y = ba)) + 
      geom_point(alpha = 1/10) + 
      geom_smooth(se = TRUE) + # Look at the reliability of the loess smooth
      theme_bw()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

  rm(batters, batting)

The text points out the “obvious” heteroscedasticity and the positive association between ba and ab. By including the standard error of the smooth, we see that as the number of observations decreases, the reliability of the smooth decreases — players with higher ab values.

Useful Summary Functions

This section lists a number of summary functions that the authors deem “useful.” Included are measures of location, spread, rank, position, count, and counts and proportions of logical values. The latter is important enough that we review the book example below.

Note that the mad function discussed in spread computes the mean absolute value statistic \(\left(\mbox{mad} = {\sum_{i=1}^n \left|x_i - \bar{x}\right| \over n}\right)\). The functions in this section are actually useful and you should read the subsections carefully.

Grouping by Multiple Variables

In working with the flight data we have grouped by dest, by tailnum, and by year and month and day. The first two of these groupings are univariate. The third is a group_by three variables. This section is devoted to the latter, multivariate grouping method.

The books example of daily flights is interesting in that it allows “roll ups”. A closer look at what is going on is warranted.

  daily <- not_cancelled %>%
    group_by(year, month, day)
  dim(daily)
## [1] 327346     19
  ### Look at the number of flights per day
  (per_day <- summarize(daily, nflights = n()))
## `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
## # A tibble: 365 x 4
## # Groups:   year, month [12]
##     year month   day nflights
##    <int> <int> <int>    <int>
##  1  2013     1     1      831
##  2  2013     1     2      928
##  3  2013     1     3      900
##  4  2013     1     4      908
##  5  2013     1     5      717
##  6  2013     1     6      829
##  7  2013     1     7      930
##  8  2013     1     8      892
##  9  2013     1     9      893
## 10  2013     1    10      929
## # ... with 355 more rows
  sum(per_day$nflights)
## [1] 327346
  ### Look at the number of flights per month --- a roll up of per day by month
  (per_month <- summarize(per_day, nflights = sum(nflights)))
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
## # A tibble: 12 x 3
## # Groups:   year [1]
##     year month nflights
##    <int> <int>    <int>
##  1  2013     1    26398
##  2  2013     2    23611
##  3  2013     3    27902
##  4  2013     4    27564
##  5  2013     5    28128
##  6  2013     6    27075
##  7  2013     7    28293
##  8  2013     8    28756
##  9  2013     9    27010
## 10  2013    10    28618
## 11  2013    11    26971
## 12  2013    12    27020
  sum(per_month$nflights)
## [1] 327346
  ### Look at the number of flights per year --- a roll up of per month by year
  (per_year <- summarize(per_month, nflights = sum(nflights)))
## # A tibble: 1 x 2
##    year nflights
## * <int>    <int>
## 1  2013   327346
  sum(per_year$nflights)
## [1] 327346

Note that a roll up can only occur in the reverse order of the group_by. That is, since we grouped by year, month, day — or day within month within year — we can get counts by day then month then year. The order of nesting is seen to be very important in our ability to roll up.

The book makes reference to a problem in rolling up other summaries based on other functions. For example, because the denominator used in finding means is the group size, a roll up of *unbalanced data can provide surprising results.

  daily <- not_cancelled %>%
    group_by(year, month, day)

  ### Look at the mean flight delay by day
  (per_day <- summarize(daily, xbar = mean(arr_delay)))
## `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
## # A tibble: 365 x 4
## # Groups:   year, month [12]
##     year month   day   xbar
##    <int> <int> <int>  <dbl>
##  1  2013     1     1 12.7  
##  2  2013     1     2 12.7  
##  3  2013     1     3  5.73 
##  4  2013     1     4 -1.93 
##  5  2013     1     5 -1.53 
##  6  2013     1     6  4.24 
##  7  2013     1     7 -4.95 
##  8  2013     1     8 -3.23 
##  9  2013     1     9 -0.264
## 10  2013     1    10 -5.90 
## # ... with 355 more rows
  mean(per_day$xbar)
## [1] 7.005464
  dim(per_day)
## [1] 365   4
  ### Look at the mean flight delay by month --- by averaging day averages
  (per_month <- summarize(per_day, xbar = sum(xbar)/365))
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
## # A tibble: 12 x 3
## # Groups:   year [1]
##     year month    xbar
##    <int> <int>   <dbl>
##  1  2013     1  0.508 
##  2  2013     2  0.457 
##  3  2013     3  0.524 
##  4  2013     4  0.910 
##  5  2013     5  0.328 
##  6  2013     6  1.35  
##  7  2013     7  1.43  
##  8  2013     8  0.500 
##  9  2013     9 -0.315 
## 10  2013    10 -0.0268
## 11  2013    11  0.0133
## 12  2013    12  1.32
  mean(per_month$xbar)
## [1] 0.5837887
  ### Look at the mean flight delay by month --- a roll up of per day by month
  (per_month <- summarize(per_day, xbar = mean(xbar)))
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
## # A tibble: 12 x 3
## # Groups:   year [1]
##     year month   xbar
##    <int> <int>  <dbl>
##  1  2013     1  5.98 
##  2  2013     2  5.96 
##  3  2013     3  6.17 
##  4  2013     4 11.1  
##  5  2013     5  3.86 
##  6  2013     6 16.5  
##  7  2013     7 16.8  
##  8  2013     8  5.89 
##  9  2013     9 -3.83 
## 10  2013    10 -0.316
## 11  2013    11  0.162
## 12  2013    12 15.6
  mean(per_month$xbar)
## [1] 6.985884
  ### Look at the number of flights per year --- a roll up of per month by year
  (per_year <- summarize(per_month, xbar = mean(xbar)))
## # A tibble: 1 x 2
##    year  xbar
## * <int> <dbl>
## 1  2013  6.99
  sum(per_year$xbar)
## [1] 6.985884
  ### Overall mean
  mean(daily$arr_delay)
## [1] 6.895377

While 6.9858845 is close to 6.8953768, they are not the same. Nor were the results of the two methods of computing the monthly means close to the values computed using a univariate group_by on month. This is just a reminder to be careful about the use of roll ups.

Ungroup

ungroup is helpful when you screw up and lose your original, ungrouped tibble. If you are careful, you should not need to ungroup.

  dim(not_cancelled)
## [1] 327346     19
  dim(daily %>% ungroup())
## [1] 327346     19
  rm(daily, not_cancelled, delays, per_day, per_month, per_year)

Grouped Mutates (and Filters)

The text discusses the use of grouped mutates and filters to describe subsets of the dataset. I have seen SQL programmers use programming to create tables of means of subgroups within a large sample. They then use these table to “describe” — via listing means that differ from the others — patterns in the sample. This method is not efficient and the modeling tools that we will discuss do a far better job of describing datasets.

Rmember, group_by, mutate, and filter are intended to be used with piping. The idea is to make reading code more intuitive. The use of mutates within subsets of data created using grouping and filtering can become difficult to follow/interpret. Sometimes it is better to use old school programming or to break up the piping so as to make code more readable.

The “by hand” comparison of descriptives of subgroups is tedious and generally ineffective. Models are more effective ways of describing general trends seen in data. However, the use of descriptives of subgroups of the data as bases for graphical displays and testing can be very effective. We will discuss post hoc techniques that effectively supplement modeling.