Chapter 10

In this chapter the authors give us a glimpse into relational databases. As we have discussed in class, “big” data requires careful use of storage. While storage may be cheap, it is not infinite. The use of related tables that are associated through key values is a common approach to storing data efficiently.

Other applications require obtaining data from multiple sources. When we are lucky, the data from these sources can be combined by matching them on the values of a selection of variables. Essentially, we are using key values to build a unified database.

The test follows a SQL approach to table management. The authors suggest that we become comfortable with “three families of verbs designed to work with relational data:”

Prerequisites

As in earlier chapters, the authors use the nycflights package to demonstrate concepts. Since Wickham wrote dplyr, it is not a surprise that the authors choose to use it here.

  ### Use pacman to install/load nycflights13 and the tidyverse (dplyr)
  p_load(nycflights13)
  p_load(tidyverse)

nycflights13 package

The authors’ choice to use the nycflights package throughout their text may be base, in large part, upon the fact that it is a five tibble realtional database. Along with the flights tibble we have already played with, there are:

  • airlines which contains a listing of full airline names and their abreviated codes.

  • airports which provides information about each FAA coded airport.

  • planes which contains information about each aircraft as identified by its tailnum.

  • weather which contains hourly information for each of the three NYC airports.

The text has a diagram on page 174 that shows the relationships betweeen the various tibbles. Note the variables that share identifying information. The following tables provide information that is equivalent to that presented in the figure.

flights airports
origin faa
dest faa
flights planes
tailnum tailnum
flights airlines
carrier carrier
flights weather
year year
month month
day day
hour hour
origin origin

Keys

Key variables are variables whose values uniquely identify observations. The operative word here is “uniquely”. There are two types of key variables:

  • primary – uniquely identifies an observation in its own table. The book notes that planes$tailnum is a primary key for the unique identification of a plane within the planes table.

  • foreign – uniquely identifies an observation in another table. Again, the book notes that flights$tailnum is a foreign key within the flights table for the identification of planes within the planes table.

Some keys act as both primary and foreign keys. In many cases, the “key” is actually a set of multiple variables/keys.

Again, the idea behind keys is that they uniquely identify observations. In many cases, we are matching a single observation within a table (e.g. weather) with multiple observations in another table (e.g. flights). As the book points out, it is a good idea to check the uniqueness of primary keys so that this one-to-many matching can take place.

As an example, we can look at the uniqueness of the airlines and weather tables’ keys.

  airlines %>%
    count(carrier) %>%
    filter(n > 1)
## # A tibble: 0 x 2
## # ... with 2 variables: carrier <chr>, n <int>
  weather %>%
    count(year, month, day, hour, origin) %>%
    filter(n > 1)
## # A tibble: 3 x 6
##    year month   day  hour origin     n
##   <int> <int> <int> <int> <chr>  <int>
## 1  2013    11     3     1 EWR        2
## 2  2013    11     3     1 JFK        2
## 3  2013    11     3     1 LGA        2
  weather %>%
    filter(year==2013, month==11, day==3, hour==1)
## # A tibble: 6 x 15
##   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed wind_gust
##   <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>     <dbl>
## 1 EWR     2013    11     3     1  52.0  39.0  61.2      310       6.90        NA
## 2 EWR     2013    11     3     1  50    39.0  65.8      290       5.75        NA
## 3 JFK     2013    11     3     1  54.0  37.9  54.5      320       9.21        NA
## 4 JFK     2013    11     3     1  52.0  37.9  58.6      310       6.90        NA
## 5 LGA     2013    11     3     1  55.0  39.0  54.7      330       9.21        NA
## 6 LGA     2013    11     3     1  54.0  39.9  58.9      310       8.06        NA
## # ... with 4 more variables: precip <dbl>, pressure <dbl>, visib <dbl>,
## #   time_hour <dttm>

The 0 nonunique rows returned by airline$carrier is good news. However, the hourly weather data shows that, unlike in the text, the keys for 11/3/2013 at 01:00 are not unique for all three origins. This is problematic.

The book suggests creating surrogate keys when multiple keys are used to identify observations. While this may be convenient, it buries the process that is used to identify the observations. If you are the only programmer, this is note a problem. However, if others need to read your code, be very careful to make sure that the derivation of the surrogate is very, very well documented.

Mutating Joins

The mutate function binds new variables to the right (higher indexed) side of the resultant matrix/database/table. For databases/tables, this makes it difficult to see the new variables. The authors get around this problem by selecting a few variables from the flights database.

  ### Select a few variables that are useful for joining
  flights2 <- flights %>%
    select(year, month, day, hour, origin, dest, tailnum, carrier)

  flights2
## # A tibble: 336,776 x 8
##     year month   day  hour origin dest  tailnum carrier
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>  
##  1  2013     1     1     5 EWR    IAH   N14228  UA     
##  2  2013     1     1     5 LGA    IAH   N24211  UA     
##  3  2013     1     1     5 JFK    MIA   N619AA  AA     
##  4  2013     1     1     5 JFK    BQN   N804JB  B6     
##  5  2013     1     1     6 LGA    ATL   N668DN  DL     
##  6  2013     1     1     5 EWR    ORD   N39463  UA     
##  7  2013     1     1     6 EWR    FLL   N516JB  B6     
##  8  2013     1     1     6 LGA    IAD   N829AS  EV     
##  9  2013     1     1     6 JFK    MCO   N593JB  B6     
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA     
## # ... with 336,766 more rows

To add the destination airport names to the table, we join the flights2 and airports tables by equating dest and faa.

  flights2 %>%
    select(-tailnum, -carrier) %>%
    left_join(airports, by = c("dest" = "faa"))
## # A tibble: 336,776 x 13
##     year month   day  hour origin dest  name    lat   lon   alt    tz dst  
##    <int> <int> <int> <dbl> <chr>  <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
##  1  2013     1     1     5 EWR    IAH   Geor~  30.0 -95.3    97    -6 A    
##  2  2013     1     1     5 LGA    IAH   Geor~  30.0 -95.3    97    -6 A    
##  3  2013     1     1     5 JFK    MIA   Miam~  25.8 -80.3     8    -5 A    
##  4  2013     1     1     5 JFK    BQN   <NA>   NA    NA      NA    NA <NA> 
##  5  2013     1     1     6 LGA    ATL   Hart~  33.6 -84.4  1026    -5 A    
##  6  2013     1     1     5 EWR    ORD   Chic~  42.0 -87.9   668    -6 A    
##  7  2013     1     1     6 EWR    FLL   Fort~  26.1 -80.2     9    -5 A    
##  8  2013     1     1     6 LGA    IAD   Wash~  38.9 -77.5   313    -5 A    
##  9  2013     1     1     6 JFK    MCO   Orla~  28.4 -81.3    96    -5 A    
## 10  2013     1     1     6 LGA    ORD   Chic~  42.0 -87.9   668    -6 A    
## # ... with 336,766 more rows, and 1 more variable: tzone <chr>

We" note that “BQN” is an unknown airport. This may be because it is Rafael Hernandez International Airport in Aguadilla, Puerto Rico which is not in the contiguous US. The syntax for “joins” will be discussed more fully below.

The following sections discuss the differences between a number of join types.

Understanding Joins

The authors use diagrams to show the results of different joins. They also present code to show the results. We can extend the examples to look at the effects of various key values.

  ### Declare x
  x <- tribble(
         ~key, ~val_x, ~val_xx,
            1, "x1",   "xx1",
            2, "x2",   "xx2",
            3, "x3",   "xx3",
           NA, "xNA",  "xxNA",
            4,   NA,   "xx4",
            5, "x5",   NA
  )

  ### Declare y
  y <- tribble(
         ~key, ~val_y, ~val_yy,
            1, "y1",   "yy1",
            2, "y2",   "yy2",
            3, "y3",   "yy3",
            4, "y4",   NA,
           NA, "yNA",  "yyNA",
            5, NA,     "yy5",
            6, "y6",   "y6"
  )
  
  ### Ask knitr to show x and y side-by-side.  There are prettier ways.
  knitr::kable(list(x, y))
key val_x val_xx
1 x1 xx1
2 x2 xx2
3 x3 xx3
NA xNA xxNA
4 NA xx4
5 x5 NA
key val_y val_yy
1 y1 yy1
2 y2 yy2
3 y3 yy3
4 y4 NA
NA yNA yyNA
5 NA yy5
6 y6 y6

Inner Joins

Inner joins match pairs of observations whenever their keys are equal. The result is a new table/data frame containing the key and all x and y values. dplyr uses by to declare the key variable.

  ### y inner_joined to x
  x %>% 
    inner_join(y, by = "key")
## # A tibble: 6 x 5
##     key val_x val_xx val_y val_yy
##   <dbl> <chr> <chr>  <chr> <chr> 
## 1     1 x1    xx1    y1    yy1   
## 2     2 x2    xx2    y2    yy2   
## 3     3 x3    xx3    y3    yy3   
## 4    NA xNA   xxNA   yNA   yyNA  
## 5     4 <NA>  xx4    y4    <NA>  
## 6     5 x5    <NA>   <NA>  yy5
  ### x inner_joined to y
  y %>%
    inner_join(x, by = "key")
## # A tibble: 6 x 5
##     key val_y val_yy val_x val_xx
##   <dbl> <chr> <chr>  <chr> <chr> 
## 1     1 y1    yy1    x1    xx1   
## 2     2 y2    yy2    x2    xx2   
## 3     3 y3    yy3    x3    xx3   
## 4     4 y4    <NA>   <NA>  xx4   
## 5    NA yNA   yyNA   xNA   xxNA  
## 6     5 <NA>  yy5    x5    <NA>

Note that the NA values in the key were matched. Also, the tables are equivalent except for the order of variables and the order of the rows. The location of the NA row is determined by the location within the “left” table.

Most importantly, the key value of 6 found in y is not matched within x and the observation is dropped.

Outer Joins

Whereas inner joins keep observations that are common to both tables, outer joins keep observations that appear in at least one table. The three types of outer joins are:

  • left joins which keep all observations appearing in the left table

  • right joins which keep all observations in the right table

  • full joins which keep all observations appearing in either of the left or right tables

Observations which are not matched have the values for the variables from the missing table filled with NA.

Left Joins
  ### left join of left table x to right table y 
  x %>% 
    left_join(y, by = "key")
## # A tibble: 6 x 5
##     key val_x val_xx val_y val_yy
##   <dbl> <chr> <chr>  <chr> <chr> 
## 1     1 x1    xx1    y1    yy1   
## 2     2 x2    xx2    y2    yy2   
## 3     3 x3    xx3    y3    yy3   
## 4    NA xNA   xxNA   yNA   yyNA  
## 5     4 <NA>  xx4    y4    <NA>  
## 6     5 x5    <NA>   <NA>  yy5
  ### left join of left table y to right table x
  y %>%
    left_join(x, by = "key")
## # A tibble: 7 x 5
##     key val_y val_yy val_x val_xx
##   <dbl> <chr> <chr>  <chr> <chr> 
## 1     1 y1    yy1    x1    xx1   
## 2     2 y2    yy2    x2    xx2   
## 3     3 y3    yy3    x3    xx3   
## 4     4 y4    <NA>   <NA>  xx4   
## 5    NA yNA   yyNA   xNA   xxNA  
## 6     5 <NA>  yy5    x5    <NA>  
## 7     6 y6    y6     <NA>  <NA>

Note that the second left join contains an observation for key = 6. This is because the observation exists in table y which is the primary, left table.

Right Joins
  ### right join of left table x to right table y 
  x %>% 
    right_join(y, by = "key")
## # A tibble: 7 x 5
##     key val_x val_xx val_y val_yy
##   <dbl> <chr> <chr>  <chr> <chr> 
## 1     1 x1    xx1    y1    yy1   
## 2     2 x2    xx2    y2    yy2   
## 3     3 x3    xx3    y3    yy3   
## 4    NA xNA   xxNA   yNA   yyNA  
## 5     4 <NA>  xx4    y4    <NA>  
## 6     5 x5    <NA>   <NA>  yy5   
## 7     6 <NA>  <NA>   y6    y6
  ### left join of left table y to right table x
  y %>%
    right_join(x, by = "key")
## # A tibble: 6 x 5
##     key val_y val_yy val_x val_xx
##   <dbl> <chr> <chr>  <chr> <chr> 
## 1     1 y1    yy1    x1    xx1   
## 2     2 y2    yy2    x2    xx2   
## 3     3 y3    yy3    x3    xx3   
## 4     4 y4    <NA>   <NA>  xx4   
## 5    NA yNA   yyNA   xNA   xxNA  
## 6     5 <NA>  yy5    x5    <NA>

In this case the first right join contains an observation for key = 6. This is because the observation exists in table y which is the primary, right table.

Full Joins
  ### full join of left table x to right table y 
  x %>% 
    full_join(y, by = "key")
## # A tibble: 7 x 5
##     key val_x val_xx val_y val_yy
##   <dbl> <chr> <chr>  <chr> <chr> 
## 1     1 x1    xx1    y1    yy1   
## 2     2 x2    xx2    y2    yy2   
## 3     3 x3    xx3    y3    yy3   
## 4    NA xNA   xxNA   yNA   yyNA  
## 5     4 <NA>  xx4    y4    <NA>  
## 6     5 x5    <NA>   <NA>  yy5   
## 7     6 <NA>  <NA>   y6    y6
  ### full join of left table y to right table x
  y %>%
    full_join(x, by = "key")
## # A tibble: 7 x 5
##     key val_y val_yy val_x val_xx
##   <dbl> <chr> <chr>  <chr> <chr> 
## 1     1 y1    yy1    x1    xx1   
## 2     2 y2    yy2    x2    xx2   
## 3     3 y3    yy3    x3    xx3   
## 4     4 y4    <NA>   <NA>  xx4   
## 5    NA yNA   yyNA   xNA   xxNA  
## 6     5 <NA>  yy5    x5    <NA>  
## 7     6 y6    y6     <NA>  <NA>

As with the inner join, both joins produce the same table. Because the left/right ordering of the tables is not important, both tables include an observation for key = 6.

Duplicat Keys

To this point we have hoped/assumed that key values are unique within tables. As the uniqueness check of the weather table shows, this may not always be the case.

When one table has unique observations that are useful for updating many observations in a second table, we may perform a one-to-many join.

  ### Declare x
  x <- tribble(
         ~key, ~val_x, ~val_xx,
            1, "x1a",   "xx1a",
            2, "x2a",   "xx2a",
            1, "x1b",   "xx1b",
           NA, "xNA",  "xxNA",
            2,   NA,   "xx2b"
  )

  ### Declare y
  y <- tribble(
         ~key, ~val_y, ~val_yy,
            1, "y1",   "yy1",
            2, "y2",   "yy2",
           NA, "yNA",  "yyNA",
            3, "y3",   "yy3"
  )
  
  ### left join of left table x to right table y 
  x %>% 
    left_join(y, by = "key")
## # A tibble: 5 x 5
##     key val_x val_xx val_y val_yy
##   <dbl> <chr> <chr>  <chr> <chr> 
## 1     1 x1a   xx1a   y1    yy1   
## 2     2 x2a   xx2a   y2    yy2   
## 3     1 x1b   xx1b   y1    yy1   
## 4    NA xNA   xxNA   yNA   yyNA  
## 5     2 <NA>  xx2b   y2    yy2
  ### left join of left table y to right table x
  y %>%
    left_join(x, by = "key")
## # A tibble: 6 x 5
##     key val_y val_yy val_x val_xx
##   <dbl> <chr> <chr>  <chr> <chr> 
## 1     1 y1    yy1    x1a   xx1a  
## 2     1 y1    yy1    x1b   xx1b  
## 3     2 y2    yy2    x2a   xx2a  
## 4     2 y2    yy2    <NA>  xx2b  
## 5    NA yNA   yyNA   xNA   xxNA  
## 6     3 y3    yy3    <NA>  <NA>

As in the left and right joins with unique values, we get “joined” observations for all observations in the foreign table. Where a matching key is not found in the primary table, NA’s are inserted for all primary variables.

When both tables contain duplicate keys, joins return the Cartesian product — all possible combinations.

  ### Declare x
  x <- tribble(
         ~key, ~val_x, ~val_xx,
            1, "x1a",   "xx1a",
            2, "x2a",   "xx2a",
            1, "x1b",   "xx1b",
           NA, "xNA",  "xxNA",
            2,   NA,   "xx2b"
  )

  ### Declare y
  y <- tribble(
         ~key, ~val_y, ~val_yy,
            1, "y1a",   "yy1a",
            1, "y1b",   "yy1b",
            2, "y2a",   "yy2a",
            2, "y2b",   "yy2b", 
            NA, "yNA",  "yyNA",
            3, "y3",   "yy3"
  )
  
  ### left join of left table x to right table y 
  x %>% 
    left_join(y, by = "key")
## # A tibble: 9 x 5
##     key val_x val_xx val_y val_yy
##   <dbl> <chr> <chr>  <chr> <chr> 
## 1     1 x1a   xx1a   y1a   yy1a  
## 2     1 x1a   xx1a   y1b   yy1b  
## 3     2 x2a   xx2a   y2a   yy2a  
## 4     2 x2a   xx2a   y2b   yy2b  
## 5     1 x1b   xx1b   y1a   yy1a  
## 6     1 x1b   xx1b   y1b   yy1b  
## 7    NA xNA   xxNA   yNA   yyNA  
## 8     2 <NA>  xx2b   y2a   yy2a  
## 9     2 <NA>  xx2b   y2b   yy2b
  ### left join of left table y to right table x
  y %>%
    left_join(x, by = "key")
## # A tibble: 10 x 5
##      key val_y val_yy val_x val_xx
##    <dbl> <chr> <chr>  <chr> <chr> 
##  1     1 y1a   yy1a   x1a   xx1a  
##  2     1 y1a   yy1a   x1b   xx1b  
##  3     1 y1b   yy1b   x1a   xx1a  
##  4     1 y1b   yy1b   x1b   xx1b  
##  5     2 y2a   yy2a   x2a   xx2a  
##  6     2 y2a   yy2a   <NA>  xx2b  
##  7     2 y2b   yy2b   x2a   xx2a  
##  8     2 y2b   yy2b   <NA>  xx2b  
##  9    NA yNA   yyNA   xNA   xxNA  
## 10     3 y3    yy3    <NA>  <NA>

In this example, the extra observation with key = 6 is treated as above. However, the repeated key values of 1 and 2 are combined for both the a and b versions giving us four instead of two complete observations.

Joins where the key values are duplicated in both tables are rarely intended. The results of such joins are usually unfortunate and hard to debug. This is why the authors suggest checking keys for uniqueness prior to using them.

Defining the Key Columns

To this point, we have used a key that was a single variable with the same name in both tables. When the name differs between tables, or we wish to use multiple key variables, we use by to inform dplyr of the relationship(s):

  • by = NULL (the default) uses all variables that appear in both tables. This is called a natural join. Using this approach, a join of weather and flights2 can be written as
  ### Use intersect (covered in a later section) to identify common variables
  intersect(names(weather),  names(flights2))
## [1] "origin" "year"   "month"  "day"    "hour"
  ### Natural left join
  flights2 %>%
    select(-tailnum, -carrier) %>%
    left_join(weather, by = NULL)
## Joining, by = c("year", "month", "day", "hour", "origin")
## # A tibble: 336,776 x 16
##     year month   day  hour origin dest   temp  dewp humid wind_dir wind_speed
##    <int> <int> <int> <dbl> <chr>  <chr> <dbl> <dbl> <dbl>    <dbl>      <dbl>
##  1  2013     1     1     5 EWR    IAH    39.0  28.0  64.4      260       12.7
##  2  2013     1     1     5 LGA    IAH    39.9  25.0  54.8      250       15.0
##  3  2013     1     1     5 JFK    MIA    39.0  27.0  61.6      260       15.0
##  4  2013     1     1     5 JFK    BQN    39.0  27.0  61.6      260       15.0
##  5  2013     1     1     6 LGA    ATL    39.9  25.0  54.8      260       16.1
##  6  2013     1     1     5 EWR    ORD    39.0  28.0  64.4      260       12.7
##  7  2013     1     1     6 EWR    FLL    37.9  28.0  67.2      240       11.5
##  8  2013     1     1     6 LGA    IAD    39.9  25.0  54.8      260       16.1
##  9  2013     1     1     6 JFK    MCO    37.9  27.0  64.3      260       13.8
## 10  2013     1     1     6 LGA    ORD    39.9  25.0  54.8      260       16.1
## # ... with 336,766 more rows, and 5 more variables: wind_gust <dbl>,
## #   precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>
  ### Short form since by = NULL is default
  flights2 %>%
    select(-tailnum, -carrier) %>%
    left_join(weather)
## Joining, by = c("year", "month", "day", "hour", "origin")
## # A tibble: 336,776 x 16
##     year month   day  hour origin dest   temp  dewp humid wind_dir wind_speed
##    <int> <int> <int> <dbl> <chr>  <chr> <dbl> <dbl> <dbl>    <dbl>      <dbl>
##  1  2013     1     1     5 EWR    IAH    39.0  28.0  64.4      260       12.7
##  2  2013     1     1     5 LGA    IAH    39.9  25.0  54.8      250       15.0
##  3  2013     1     1     5 JFK    MIA    39.0  27.0  61.6      260       15.0
##  4  2013     1     1     5 JFK    BQN    39.0  27.0  61.6      260       15.0
##  5  2013     1     1     6 LGA    ATL    39.9  25.0  54.8      260       16.1
##  6  2013     1     1     5 EWR    ORD    39.0  28.0  64.4      260       12.7
##  7  2013     1     1     6 EWR    FLL    37.9  28.0  67.2      240       11.5
##  8  2013     1     1     6 LGA    IAD    39.9  25.0  54.8      260       16.1
##  9  2013     1     1     6 JFK    MCO    37.9  27.0  64.3      260       13.8
## 10  2013     1     1     6 LGA    ORD    39.9  25.0  54.8      260       16.1
## # ... with 336,766 more rows, and 5 more variables: wind_gust <dbl>,
## #   precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>
  • A character vector, e.g by = “varname”. Like the natural join, we use a vector of commonly named variables to define the key relationship. The list of common variables need not be exhaustive.
  ### Determine common variables
  intersect(names(flights2), names(planes))
## [1] "year"    "tailnum"
  ### Check tailnum uniqueness
  planes %>%
    count(tailnum) %>%
    filter(n > 1)
## # A tibble: 0 x 2
## # ... with 2 variables: tailnum <chr>, n <int>
  ### Join on tailnum and ignore year
  flights2 %>%
    left_join(planes, by = "tailnum")
## # A tibble: 336,776 x 16
##    year.x month   day  hour origin dest  tailnum carrier year.y type 
##     <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>    <int> <chr>
##  1   2013     1     1     5 EWR    IAH   N14228  UA        1999 Fixe~
##  2   2013     1     1     5 LGA    IAH   N24211  UA        1998 Fixe~
##  3   2013     1     1     5 JFK    MIA   N619AA  AA        1990 Fixe~
##  4   2013     1     1     5 JFK    BQN   N804JB  B6        2012 Fixe~
##  5   2013     1     1     6 LGA    ATL   N668DN  DL        1991 Fixe~
##  6   2013     1     1     5 EWR    ORD   N39463  UA        2012 Fixe~
##  7   2013     1     1     6 EWR    FLL   N516JB  B6        2000 Fixe~
##  8   2013     1     1     6 LGA    IAD   N829AS  EV        1998 Fixe~
##  9   2013     1     1     6 JFK    MCO   N593JB  B6        2004 Fixe~
## 10   2013     1     1     6 LGA    ORD   N3ALAA  AA          NA <NA> 
## # ... with 336,766 more rows, and 6 more variables: manufacturer <chr>,
## #   model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>
  ### Recreate by = NULL join from above using explicit variable naming
  flights2 %>%
    select(-tailnum, -carrier) %>%
    left_join(weather, by = c("year", "month", "day", "hour", "origin"))
## # A tibble: 336,776 x 16
##     year month   day  hour origin dest   temp  dewp humid wind_dir wind_speed
##    <int> <int> <int> <dbl> <chr>  <chr> <dbl> <dbl> <dbl>    <dbl>      <dbl>
##  1  2013     1     1     5 EWR    IAH    39.0  28.0  64.4      260       12.7
##  2  2013     1     1     5 LGA    IAH    39.9  25.0  54.8      250       15.0
##  3  2013     1     1     5 JFK    MIA    39.0  27.0  61.6      260       15.0
##  4  2013     1     1     5 JFK    BQN    39.0  27.0  61.6      260       15.0
##  5  2013     1     1     6 LGA    ATL    39.9  25.0  54.8      260       16.1
##  6  2013     1     1     5 EWR    ORD    39.0  28.0  64.4      260       12.7
##  7  2013     1     1     6 EWR    FLL    37.9  28.0  67.2      240       11.5
##  8  2013     1     1     6 LGA    IAD    39.9  25.0  54.8      260       16.1
##  9  2013     1     1     6 JFK    MCO    37.9  27.0  64.3      260       13.8
## 10  2013     1     1     6 LGA    ORD    39.9  25.0  54.8      260       16.1
## # ... with 336,766 more rows, and 5 more variables: wind_gust <dbl>,
## #   precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>

As noted in the text, the variable year appears to be common to both the planes and flights2 tables. However, closer inspection shows that the variable represents the year of the flight in the flights2 table and the year the plane was entered into service in the planes table. These two variables are displayed as year.x and year.y in the joined table.

Use of a declared vector of matched key variables, while not as convenient, makes code easier to read. * A named character vector, e.g. by = c(“x.var” = “y.var”). This method allows us to match differently named variables from the two tables. The variable name from the left table is used in the joined table. This is the approach that was used in the airports** example presented below. For convenience, the example is reproduced below along with its origin counterpart.

  ### Merge airport information based upon destination code
  flights2 %>%
    select(-tailnum, -carrier) %>%
    left_join(airports, by = c("dest" = "faa"))
## # A tibble: 336,776 x 13
##     year month   day  hour origin dest  name    lat   lon   alt    tz dst  
##    <int> <int> <int> <dbl> <chr>  <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
##  1  2013     1     1     5 EWR    IAH   Geor~  30.0 -95.3    97    -6 A    
##  2  2013     1     1     5 LGA    IAH   Geor~  30.0 -95.3    97    -6 A    
##  3  2013     1     1     5 JFK    MIA   Miam~  25.8 -80.3     8    -5 A    
##  4  2013     1     1     5 JFK    BQN   <NA>   NA    NA      NA    NA <NA> 
##  5  2013     1     1     6 LGA    ATL   Hart~  33.6 -84.4  1026    -5 A    
##  6  2013     1     1     5 EWR    ORD   Chic~  42.0 -87.9   668    -6 A    
##  7  2013     1     1     6 EWR    FLL   Fort~  26.1 -80.2     9    -5 A    
##  8  2013     1     1     6 LGA    IAD   Wash~  38.9 -77.5   313    -5 A    
##  9  2013     1     1     6 JFK    MCO   Orla~  28.4 -81.3    96    -5 A    
## 10  2013     1     1     6 LGA    ORD   Chic~  42.0 -87.9   668    -6 A    
## # ... with 336,766 more rows, and 1 more variable: tzone <chr>
  ### Merge airport information based upon origin code
  flights2 %>%
    select(-tailnum, -carrier) %>%
    left_join(airports, by = c("origin" = "faa"))
## # A tibble: 336,776 x 13
##     year month   day  hour origin dest  name    lat   lon   alt    tz dst  
##    <int> <int> <int> <dbl> <chr>  <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
##  1  2013     1     1     5 EWR    IAH   Newa~  40.7 -74.2    18    -5 A    
##  2  2013     1     1     5 LGA    IAH   La G~  40.8 -73.9    22    -5 A    
##  3  2013     1     1     5 JFK    MIA   John~  40.6 -73.8    13    -5 A    
##  4  2013     1     1     5 JFK    BQN   John~  40.6 -73.8    13    -5 A    
##  5  2013     1     1     6 LGA    ATL   La G~  40.8 -73.9    22    -5 A    
##  6  2013     1     1     5 EWR    ORD   Newa~  40.7 -74.2    18    -5 A    
##  7  2013     1     1     6 EWR    FLL   Newa~  40.7 -74.2    18    -5 A    
##  8  2013     1     1     6 LGA    IAD   La G~  40.8 -73.9    22    -5 A    
##  9  2013     1     1     6 JFK    MCO   John~  40.6 -73.8    13    -5 A    
## 10  2013     1     1     6 LGA    ORD   La G~  40.8 -73.9    22    -5 A    
## # ... with 336,766 more rows, and 1 more variable: tzone <chr>

Filtering Joins

Filtering joins filter observations and do not change nor add variables. The two types of filtering joins are:

  • semi_join(x, y) that keeps all of the observations in x that have a match in y.

  • anti_join(x, y) that drops all of the observations in x that have a match in y.

*Semi-joins

Semi-joins are typically used to match summary information back to the original rows. An example would be finding flights to destinations with fewer than 52 flights per year — fewer than one per week.

  ### Determine destinations with too few flights
  too_few <- flights %>%
               group_by(dest) %>%
               summarize(n_flights = n(), na.rm = TRUE) %>% # Beware of cancelled flights
               filter(n_flights < 52)

  too_few
## # A tibble: 10 x 3
##    dest  n_flights na.rm
##    <chr>     <int> <lgl>
##  1 ANC           8 TRUE 
##  2 BZN          36 TRUE 
##  3 EYW          17 TRUE 
##  4 HDN          15 TRUE 
##  5 JAC          25 TRUE 
##  6 LEX           1 TRUE 
##  7 LGA           1 TRUE 
##  8 MTJ          15 TRUE 
##  9 PSP          19 TRUE 
## 10 SBN          10 TRUE
  ### Join with original flights data.  This will exclude flights to other dests
  flights %>%
    semi_join(too_few) %>%
    arrange(dest, year, month, day, dep_time)
## Joining, by = "dest"
## # A tibble: 147 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     7     6     1629           1615        14     1954           1953
##  2  2013     7    13     1618           1615         3     1955           1953
##  3  2013     7    20     1618           1615         3     2003           1953
##  4  2013     7    27     1617           1615         2     1906           1953
##  5  2013     8     3     1615           1615         0     2003           1953
##  6  2013     8    10     1613           1615        -2     1922           1953
##  7  2013     8    17     1740           1625        75     2042           2003
##  8  2013     8    24     1633           1625         8     1959           2003
##  9  2013     1     5      850            825        25     1132           1123
## 10  2013     1    12      825            825         0     1115           1123
## # ... with 137 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
  names(too_few)  
## [1] "dest"      "n_flights" "na.rm"
  names(airports)
## [1] "faa"   "name"  "lat"   "lon"   "alt"   "tz"    "dst"   "tzone"
  ### Use a semi_join to keep only those airports that have too few flights
  ### and plot their locations and FAA codes
  airports %>%
    semi_join(too_few, by = c("faa" = "dest")) %>% 
    ggplot(aes(lon, lat)) +
    borders("state") +
    geom_point() +
    coord_quickmap() +
    geom_text(aes(label=faa))

Use of summarize instead of count is intended to help you deal with other summary statistics. This will be necessary if you choose to look at standard deviations, medians, etc.

When matched back to the flights data, we see that many of the destinations seem to have had limited runs (few months) within 2013. The single flight to LGA may have been an emergency or a “repositioning”.

The plot is nice, but not really that informative. Note that Palm Springs is one of the low usage airports.

Join Problems

The test points out a few problems that you may run into when using joins. Be sure to read what the authors have to say.

Set Operations

Since some of you have not dealt with sets, we will keep this section short. Note that the set operators work on multiple variables simultaneously.

First, some simple set stuff.

  A <- c("a", "b", "c")
  B <- c("c", "d", "d", "f")
  S <- c("a", "b", "c", "d", "e", "f", "g")
  
  intersect(A, B)
## [1] "c"
  union(A, B)
## [1] "a" "b" "c" "d" "f"
  setdiff(A, B)
## [1] "a" "b"
  setdiff(B, A)
## [1] "d" "f"
  intersect(A, S)
## [1] "a" "b" "c"
  union(S, A)
## [1] "a" "b" "c" "d" "e" "f" "g"
  setdiff(S, A)
## [1] "d" "e" "f" "g"
  setdiff(A, S)
## character(0)

The book gives as an example data frames that contain multiple variables. A similar example would be

  A <- tribble(
         ~x, ~y,
          1, 1,
          1, 2,
          2, 2,
          1, 2
       )

  B <- tribble(
         ~x, ~y,
          1, 1,
          2, 1,
          2, 1
       )


  intersect(A, B)
## # A tibble: 1 x 2
##       x     y
##   <dbl> <dbl>
## 1     1     1
  union(A, B)
## # A tibble: 4 x 2
##       x     y
##   <dbl> <dbl>
## 1     1     1
## 2     1     2
## 3     2     2
## 4     2     1
  setdiff(B, A)
## # A tibble: 1 x 2
##       x     y
##   <dbl> <dbl>
## 1     2     1
  setdiff(A, B)
## # A tibble: 2 x 2
##       x     y
##   <dbl> <dbl>
## 1     1     2
## 2     2     2

Note that duplicated observations are not counted as is proper with sets.