In this chapter the authors give us a glimpse into relational databases. As we have discussed in class, “big” data requires careful use of storage. While storage may be cheap, it is not infinite. The use of related tables that are associated through key values is a common approach to storing data efficiently.
Other applications require obtaining data from multiple sources. When we are lucky, the data from these sources can be combined by matching them on the values of a selection of variables. Essentially, we are using key values to build a unified database.
The test follows a SQL approach to table management. The authors suggest that we become comfortable with “three families of verbs designed to work with relational data:”
Mutating joins, add variables to an existing database/table by taking them from another database/table. The values of the variables are determined by matching observations based on the values of other variables.
Filtering joins, filter observations from one database/table based upon the existense (or lack thereof) of a matching observation in another database/table.
Set operations, which are used to treat observations in a database/table as if they are elements of a set.
As in earlier chapters, the authors use the nycflights package to demonstrate concepts. Since Wickham wrote dplyr, it is not a surprise that the authors choose to use it here.
### Use pacman to install/load nycflights13 and the tidyverse (dplyr)
p_load(nycflights13)
p_load(tidyverse)
The authors’ choice to use the nycflights package throughout their text may be base, in large part, upon the fact that it is a five tibble realtional database. Along with the flights tibble we have already played with, there are:
airlines which contains a listing of full airline names and their abreviated codes.
airports which provides information about each FAA coded airport.
planes which contains information about each aircraft as identified by its tailnum.
weather which contains hourly information for each of the three NYC airports.
The text has a diagram on page 174 that shows the relationships betweeen the various tibbles. Note the variables that share identifying information. The following tables provide information that is equivalent to that presented in the figure.
flights | airports |
---|---|
origin | faa |
dest | faa |
flights | planes |
---|---|
tailnum | tailnum |
flights | airlines |
---|---|
carrier | carrier |
flights | weather |
---|---|
year | year |
month | month |
day | day |
hour | hour |
origin | origin |
Key variables are variables whose values uniquely identify observations. The operative word here is “uniquely”. There are two types of key variables:
primary – uniquely identifies an observation in its own table. The book notes that planes$tailnum is a primary key for the unique identification of a plane within the planes table.
foreign – uniquely identifies an observation in another table. Again, the book notes that flights$tailnum is a foreign key within the flights table for the identification of planes within the planes table.
Some keys act as both primary and foreign keys. In many cases, the “key” is actually a set of multiple variables/keys.
Again, the idea behind keys is that they uniquely identify observations. In many cases, we are matching a single observation within a table (e.g. weather) with multiple observations in another table (e.g. flights). As the book points out, it is a good idea to check the uniqueness of primary keys so that this one-to-many matching can take place.
As an example, we can look at the uniqueness of the airlines and weather tables’ keys.
airlines %>%
count(carrier) %>%
filter(n > 1)
## # A tibble: 0 x 2
## # ... with 2 variables: carrier <chr>, n <int>
weather %>%
count(year, month, day, hour, origin) %>%
filter(n > 1)
## # A tibble: 3 x 6
## year month day hour origin n
## <int> <int> <int> <int> <chr> <int>
## 1 2013 11 3 1 EWR 2
## 2 2013 11 3 1 JFK 2
## 3 2013 11 3 1 LGA 2
weather %>%
filter(year==2013, month==11, day==3, hour==1)
## # A tibble: 6 x 15
## origin year month day hour temp dewp humid wind_dir wind_speed wind_gust
## <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 2013 11 3 1 52.0 39.0 61.2 310 6.90 NA
## 2 EWR 2013 11 3 1 50 39.0 65.8 290 5.75 NA
## 3 JFK 2013 11 3 1 54.0 37.9 54.5 320 9.21 NA
## 4 JFK 2013 11 3 1 52.0 37.9 58.6 310 6.90 NA
## 5 LGA 2013 11 3 1 55.0 39.0 54.7 330 9.21 NA
## 6 LGA 2013 11 3 1 54.0 39.9 58.9 310 8.06 NA
## # ... with 4 more variables: precip <dbl>, pressure <dbl>, visib <dbl>,
## # time_hour <dttm>
The 0 nonunique rows returned by airline$carrier is good news. However, the hourly weather data shows that, unlike in the text, the keys for 11/3/2013 at 01:00 are not unique for all three origins. This is problematic.
The book suggests creating surrogate keys when multiple keys are used to identify observations. While this may be convenient, it buries the process that is used to identify the observations. If you are the only programmer, this is note a problem. However, if others need to read your code, be very careful to make sure that the derivation of the surrogate is very, very well documented.
The mutate function binds new variables to the right (higher indexed) side of the resultant matrix/database/table. For databases/tables, this makes it difficult to see the new variables. The authors get around this problem by selecting a few variables from the flights database.
### Select a few variables that are useful for joining
flights2 <- flights %>%
select(year, month, day, hour, origin, dest, tailnum, carrier)
flights2
## # A tibble: 336,776 x 8
## year month day hour origin dest tailnum carrier
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr>
## 1 2013 1 1 5 EWR IAH N14228 UA
## 2 2013 1 1 5 LGA IAH N24211 UA
## 3 2013 1 1 5 JFK MIA N619AA AA
## 4 2013 1 1 5 JFK BQN N804JB B6
## 5 2013 1 1 6 LGA ATL N668DN DL
## 6 2013 1 1 5 EWR ORD N39463 UA
## 7 2013 1 1 6 EWR FLL N516JB B6
## 8 2013 1 1 6 LGA IAD N829AS EV
## 9 2013 1 1 6 JFK MCO N593JB B6
## 10 2013 1 1 6 LGA ORD N3ALAA AA
## # ... with 336,766 more rows
To add the destination airport names to the table, we join the flights2 and airports tables by equating dest and faa.
flights2 %>%
select(-tailnum, -carrier) %>%
left_join(airports, by = c("dest" = "faa"))
## # A tibble: 336,776 x 13
## year month day hour origin dest name lat lon alt tz dst
## <int> <int> <int> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 2013 1 1 5 EWR IAH Geor~ 30.0 -95.3 97 -6 A
## 2 2013 1 1 5 LGA IAH Geor~ 30.0 -95.3 97 -6 A
## 3 2013 1 1 5 JFK MIA Miam~ 25.8 -80.3 8 -5 A
## 4 2013 1 1 5 JFK BQN <NA> NA NA NA NA <NA>
## 5 2013 1 1 6 LGA ATL Hart~ 33.6 -84.4 1026 -5 A
## 6 2013 1 1 5 EWR ORD Chic~ 42.0 -87.9 668 -6 A
## 7 2013 1 1 6 EWR FLL Fort~ 26.1 -80.2 9 -5 A
## 8 2013 1 1 6 LGA IAD Wash~ 38.9 -77.5 313 -5 A
## 9 2013 1 1 6 JFK MCO Orla~ 28.4 -81.3 96 -5 A
## 10 2013 1 1 6 LGA ORD Chic~ 42.0 -87.9 668 -6 A
## # ... with 336,766 more rows, and 1 more variable: tzone <chr>
We" note that “BQN” is an unknown airport. This may be because it is Rafael Hernandez International Airport in Aguadilla, Puerto Rico which is not in the contiguous US. The syntax for “joins” will be discussed more fully below.
The following sections discuss the differences between a number of join types.
The authors use diagrams to show the results of different joins. They also present code to show the results. We can extend the examples to look at the effects of various key values.
### Declare x
x <- tribble(
~key, ~val_x, ~val_xx,
1, "x1", "xx1",
2, "x2", "xx2",
3, "x3", "xx3",
NA, "xNA", "xxNA",
4, NA, "xx4",
5, "x5", NA
)
### Declare y
y <- tribble(
~key, ~val_y, ~val_yy,
1, "y1", "yy1",
2, "y2", "yy2",
3, "y3", "yy3",
4, "y4", NA,
NA, "yNA", "yyNA",
5, NA, "yy5",
6, "y6", "y6"
)
### Ask knitr to show x and y side-by-side. There are prettier ways.
knitr::kable(list(x, y))
|
|
Inner joins match pairs of observations whenever their keys are equal. The result is a new table/data frame containing the key and all x and y values. dplyr uses by to declare the key variable.
### y inner_joined to x
x %>%
inner_join(y, by = "key")
## # A tibble: 6 x 5
## key val_x val_xx val_y val_yy
## <dbl> <chr> <chr> <chr> <chr>
## 1 1 x1 xx1 y1 yy1
## 2 2 x2 xx2 y2 yy2
## 3 3 x3 xx3 y3 yy3
## 4 NA xNA xxNA yNA yyNA
## 5 4 <NA> xx4 y4 <NA>
## 6 5 x5 <NA> <NA> yy5
### x inner_joined to y
y %>%
inner_join(x, by = "key")
## # A tibble: 6 x 5
## key val_y val_yy val_x val_xx
## <dbl> <chr> <chr> <chr> <chr>
## 1 1 y1 yy1 x1 xx1
## 2 2 y2 yy2 x2 xx2
## 3 3 y3 yy3 x3 xx3
## 4 4 y4 <NA> <NA> xx4
## 5 NA yNA yyNA xNA xxNA
## 6 5 <NA> yy5 x5 <NA>
Note that the NA values in the key were matched. Also, the tables are equivalent except for the order of variables and the order of the rows. The location of the NA row is determined by the location within the “left” table.
Most importantly, the key value of 6 found in y is not matched within x and the observation is dropped.
Whereas inner joins keep observations that are common to both tables, outer joins keep observations that appear in at least one table. The three types of outer joins are:
left joins which keep all observations appearing in the left table
right joins which keep all observations in the right table
full joins which keep all observations appearing in either of the left or right tables
Observations which are not matched have the values for the variables from the missing table filled with NA.
### left join of left table x to right table y
x %>%
left_join(y, by = "key")
## # A tibble: 6 x 5
## key val_x val_xx val_y val_yy
## <dbl> <chr> <chr> <chr> <chr>
## 1 1 x1 xx1 y1 yy1
## 2 2 x2 xx2 y2 yy2
## 3 3 x3 xx3 y3 yy3
## 4 NA xNA xxNA yNA yyNA
## 5 4 <NA> xx4 y4 <NA>
## 6 5 x5 <NA> <NA> yy5
### left join of left table y to right table x
y %>%
left_join(x, by = "key")
## # A tibble: 7 x 5
## key val_y val_yy val_x val_xx
## <dbl> <chr> <chr> <chr> <chr>
## 1 1 y1 yy1 x1 xx1
## 2 2 y2 yy2 x2 xx2
## 3 3 y3 yy3 x3 xx3
## 4 4 y4 <NA> <NA> xx4
## 5 NA yNA yyNA xNA xxNA
## 6 5 <NA> yy5 x5 <NA>
## 7 6 y6 y6 <NA> <NA>
Note that the second left join contains an observation for key = 6. This is because the observation exists in table y which is the primary, left table.
### right join of left table x to right table y
x %>%
right_join(y, by = "key")
## # A tibble: 7 x 5
## key val_x val_xx val_y val_yy
## <dbl> <chr> <chr> <chr> <chr>
## 1 1 x1 xx1 y1 yy1
## 2 2 x2 xx2 y2 yy2
## 3 3 x3 xx3 y3 yy3
## 4 NA xNA xxNA yNA yyNA
## 5 4 <NA> xx4 y4 <NA>
## 6 5 x5 <NA> <NA> yy5
## 7 6 <NA> <NA> y6 y6
### left join of left table y to right table x
y %>%
right_join(x, by = "key")
## # A tibble: 6 x 5
## key val_y val_yy val_x val_xx
## <dbl> <chr> <chr> <chr> <chr>
## 1 1 y1 yy1 x1 xx1
## 2 2 y2 yy2 x2 xx2
## 3 3 y3 yy3 x3 xx3
## 4 4 y4 <NA> <NA> xx4
## 5 NA yNA yyNA xNA xxNA
## 6 5 <NA> yy5 x5 <NA>
In this case the first right join contains an observation for key = 6. This is because the observation exists in table y which is the primary, right table.
### full join of left table x to right table y
x %>%
full_join(y, by = "key")
## # A tibble: 7 x 5
## key val_x val_xx val_y val_yy
## <dbl> <chr> <chr> <chr> <chr>
## 1 1 x1 xx1 y1 yy1
## 2 2 x2 xx2 y2 yy2
## 3 3 x3 xx3 y3 yy3
## 4 NA xNA xxNA yNA yyNA
## 5 4 <NA> xx4 y4 <NA>
## 6 5 x5 <NA> <NA> yy5
## 7 6 <NA> <NA> y6 y6
### full join of left table y to right table x
y %>%
full_join(x, by = "key")
## # A tibble: 7 x 5
## key val_y val_yy val_x val_xx
## <dbl> <chr> <chr> <chr> <chr>
## 1 1 y1 yy1 x1 xx1
## 2 2 y2 yy2 x2 xx2
## 3 3 y3 yy3 x3 xx3
## 4 4 y4 <NA> <NA> xx4
## 5 NA yNA yyNA xNA xxNA
## 6 5 <NA> yy5 x5 <NA>
## 7 6 y6 y6 <NA> <NA>
As with the inner join, both joins produce the same table. Because the left/right ordering of the tables is not important, both tables include an observation for key = 6.
To this point we have hoped/assumed that key values are unique within tables. As the uniqueness check of the weather table shows, this may not always be the case.
When one table has unique observations that are useful for updating many observations in a second table, we may perform a one-to-many join.
### Declare x
x <- tribble(
~key, ~val_x, ~val_xx,
1, "x1a", "xx1a",
2, "x2a", "xx2a",
1, "x1b", "xx1b",
NA, "xNA", "xxNA",
2, NA, "xx2b"
)
### Declare y
y <- tribble(
~key, ~val_y, ~val_yy,
1, "y1", "yy1",
2, "y2", "yy2",
NA, "yNA", "yyNA",
3, "y3", "yy3"
)
### left join of left table x to right table y
x %>%
left_join(y, by = "key")
## # A tibble: 5 x 5
## key val_x val_xx val_y val_yy
## <dbl> <chr> <chr> <chr> <chr>
## 1 1 x1a xx1a y1 yy1
## 2 2 x2a xx2a y2 yy2
## 3 1 x1b xx1b y1 yy1
## 4 NA xNA xxNA yNA yyNA
## 5 2 <NA> xx2b y2 yy2
### left join of left table y to right table x
y %>%
left_join(x, by = "key")
## # A tibble: 6 x 5
## key val_y val_yy val_x val_xx
## <dbl> <chr> <chr> <chr> <chr>
## 1 1 y1 yy1 x1a xx1a
## 2 1 y1 yy1 x1b xx1b
## 3 2 y2 yy2 x2a xx2a
## 4 2 y2 yy2 <NA> xx2b
## 5 NA yNA yyNA xNA xxNA
## 6 3 y3 yy3 <NA> <NA>
As in the left and right joins with unique values, we get “joined” observations for all observations in the foreign table. Where a matching key is not found in the primary table, NA’s are inserted for all primary variables.
When both tables contain duplicate keys, joins return the Cartesian product — all possible combinations.
### Declare x
x <- tribble(
~key, ~val_x, ~val_xx,
1, "x1a", "xx1a",
2, "x2a", "xx2a",
1, "x1b", "xx1b",
NA, "xNA", "xxNA",
2, NA, "xx2b"
)
### Declare y
y <- tribble(
~key, ~val_y, ~val_yy,
1, "y1a", "yy1a",
1, "y1b", "yy1b",
2, "y2a", "yy2a",
2, "y2b", "yy2b",
NA, "yNA", "yyNA",
3, "y3", "yy3"
)
### left join of left table x to right table y
x %>%
left_join(y, by = "key")
## # A tibble: 9 x 5
## key val_x val_xx val_y val_yy
## <dbl> <chr> <chr> <chr> <chr>
## 1 1 x1a xx1a y1a yy1a
## 2 1 x1a xx1a y1b yy1b
## 3 2 x2a xx2a y2a yy2a
## 4 2 x2a xx2a y2b yy2b
## 5 1 x1b xx1b y1a yy1a
## 6 1 x1b xx1b y1b yy1b
## 7 NA xNA xxNA yNA yyNA
## 8 2 <NA> xx2b y2a yy2a
## 9 2 <NA> xx2b y2b yy2b
### left join of left table y to right table x
y %>%
left_join(x, by = "key")
## # A tibble: 10 x 5
## key val_y val_yy val_x val_xx
## <dbl> <chr> <chr> <chr> <chr>
## 1 1 y1a yy1a x1a xx1a
## 2 1 y1a yy1a x1b xx1b
## 3 1 y1b yy1b x1a xx1a
## 4 1 y1b yy1b x1b xx1b
## 5 2 y2a yy2a x2a xx2a
## 6 2 y2a yy2a <NA> xx2b
## 7 2 y2b yy2b x2a xx2a
## 8 2 y2b yy2b <NA> xx2b
## 9 NA yNA yyNA xNA xxNA
## 10 3 y3 yy3 <NA> <NA>
In this example, the extra observation with key = 6 is treated as above. However, the repeated key values of 1 and 2 are combined for both the a and b versions giving us four instead of two complete observations.
Joins where the key values are duplicated in both tables are rarely intended. The results of such joins are usually unfortunate and hard to debug. This is why the authors suggest checking keys for uniqueness prior to using them.
To this point, we have used a key that was a single variable with the same name in both tables. When the name differs between tables, or we wish to use multiple key variables, we use by to inform dplyr of the relationship(s):
### Use intersect (covered in a later section) to identify common variables
intersect(names(weather), names(flights2))
## [1] "origin" "year" "month" "day" "hour"
### Natural left join
flights2 %>%
select(-tailnum, -carrier) %>%
left_join(weather, by = NULL)
## Joining, by = c("year", "month", "day", "hour", "origin")
## # A tibble: 336,776 x 16
## year month day hour origin dest temp dewp humid wind_dir wind_speed
## <int> <int> <int> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 5 EWR IAH 39.0 28.0 64.4 260 12.7
## 2 2013 1 1 5 LGA IAH 39.9 25.0 54.8 250 15.0
## 3 2013 1 1 5 JFK MIA 39.0 27.0 61.6 260 15.0
## 4 2013 1 1 5 JFK BQN 39.0 27.0 61.6 260 15.0
## 5 2013 1 1 6 LGA ATL 39.9 25.0 54.8 260 16.1
## 6 2013 1 1 5 EWR ORD 39.0 28.0 64.4 260 12.7
## 7 2013 1 1 6 EWR FLL 37.9 28.0 67.2 240 11.5
## 8 2013 1 1 6 LGA IAD 39.9 25.0 54.8 260 16.1
## 9 2013 1 1 6 JFK MCO 37.9 27.0 64.3 260 13.8
## 10 2013 1 1 6 LGA ORD 39.9 25.0 54.8 260 16.1
## # ... with 336,766 more rows, and 5 more variables: wind_gust <dbl>,
## # precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>
### Short form since by = NULL is default
flights2 %>%
select(-tailnum, -carrier) %>%
left_join(weather)
## Joining, by = c("year", "month", "day", "hour", "origin")
## # A tibble: 336,776 x 16
## year month day hour origin dest temp dewp humid wind_dir wind_speed
## <int> <int> <int> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 5 EWR IAH 39.0 28.0 64.4 260 12.7
## 2 2013 1 1 5 LGA IAH 39.9 25.0 54.8 250 15.0
## 3 2013 1 1 5 JFK MIA 39.0 27.0 61.6 260 15.0
## 4 2013 1 1 5 JFK BQN 39.0 27.0 61.6 260 15.0
## 5 2013 1 1 6 LGA ATL 39.9 25.0 54.8 260 16.1
## 6 2013 1 1 5 EWR ORD 39.0 28.0 64.4 260 12.7
## 7 2013 1 1 6 EWR FLL 37.9 28.0 67.2 240 11.5
## 8 2013 1 1 6 LGA IAD 39.9 25.0 54.8 260 16.1
## 9 2013 1 1 6 JFK MCO 37.9 27.0 64.3 260 13.8
## 10 2013 1 1 6 LGA ORD 39.9 25.0 54.8 260 16.1
## # ... with 336,766 more rows, and 5 more variables: wind_gust <dbl>,
## # precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>
### Determine common variables
intersect(names(flights2), names(planes))
## [1] "year" "tailnum"
### Check tailnum uniqueness
planes %>%
count(tailnum) %>%
filter(n > 1)
## # A tibble: 0 x 2
## # ... with 2 variables: tailnum <chr>, n <int>
### Join on tailnum and ignore year
flights2 %>%
left_join(planes, by = "tailnum")
## # A tibble: 336,776 x 16
## year.x month day hour origin dest tailnum carrier year.y type
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <int> <chr>
## 1 2013 1 1 5 EWR IAH N14228 UA 1999 Fixe~
## 2 2013 1 1 5 LGA IAH N24211 UA 1998 Fixe~
## 3 2013 1 1 5 JFK MIA N619AA AA 1990 Fixe~
## 4 2013 1 1 5 JFK BQN N804JB B6 2012 Fixe~
## 5 2013 1 1 6 LGA ATL N668DN DL 1991 Fixe~
## 6 2013 1 1 5 EWR ORD N39463 UA 2012 Fixe~
## 7 2013 1 1 6 EWR FLL N516JB B6 2000 Fixe~
## 8 2013 1 1 6 LGA IAD N829AS EV 1998 Fixe~
## 9 2013 1 1 6 JFK MCO N593JB B6 2004 Fixe~
## 10 2013 1 1 6 LGA ORD N3ALAA AA NA <NA>
## # ... with 336,766 more rows, and 6 more variables: manufacturer <chr>,
## # model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>
### Recreate by = NULL join from above using explicit variable naming
flights2 %>%
select(-tailnum, -carrier) %>%
left_join(weather, by = c("year", "month", "day", "hour", "origin"))
## # A tibble: 336,776 x 16
## year month day hour origin dest temp dewp humid wind_dir wind_speed
## <int> <int> <int> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 5 EWR IAH 39.0 28.0 64.4 260 12.7
## 2 2013 1 1 5 LGA IAH 39.9 25.0 54.8 250 15.0
## 3 2013 1 1 5 JFK MIA 39.0 27.0 61.6 260 15.0
## 4 2013 1 1 5 JFK BQN 39.0 27.0 61.6 260 15.0
## 5 2013 1 1 6 LGA ATL 39.9 25.0 54.8 260 16.1
## 6 2013 1 1 5 EWR ORD 39.0 28.0 64.4 260 12.7
## 7 2013 1 1 6 EWR FLL 37.9 28.0 67.2 240 11.5
## 8 2013 1 1 6 LGA IAD 39.9 25.0 54.8 260 16.1
## 9 2013 1 1 6 JFK MCO 37.9 27.0 64.3 260 13.8
## 10 2013 1 1 6 LGA ORD 39.9 25.0 54.8 260 16.1
## # ... with 336,766 more rows, and 5 more variables: wind_gust <dbl>,
## # precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>
As noted in the text, the variable year appears to be common to both the planes and flights2 tables. However, closer inspection shows that the variable represents the year of the flight in the flights2 table and the year the plane was entered into service in the planes table. These two variables are displayed as year.x and year.y in the joined table.
Use of a declared vector of matched key variables, while not as convenient, makes code easier to read. * A named character vector, e.g. by = c(“x.var” = “y.var”). This method allows us to match differently named variables from the two tables. The variable name from the left table is used in the joined table. This is the approach that was used in the airports** example presented below. For convenience, the example is reproduced below along with its origin counterpart.
### Merge airport information based upon destination code
flights2 %>%
select(-tailnum, -carrier) %>%
left_join(airports, by = c("dest" = "faa"))
## # A tibble: 336,776 x 13
## year month day hour origin dest name lat lon alt tz dst
## <int> <int> <int> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 2013 1 1 5 EWR IAH Geor~ 30.0 -95.3 97 -6 A
## 2 2013 1 1 5 LGA IAH Geor~ 30.0 -95.3 97 -6 A
## 3 2013 1 1 5 JFK MIA Miam~ 25.8 -80.3 8 -5 A
## 4 2013 1 1 5 JFK BQN <NA> NA NA NA NA <NA>
## 5 2013 1 1 6 LGA ATL Hart~ 33.6 -84.4 1026 -5 A
## 6 2013 1 1 5 EWR ORD Chic~ 42.0 -87.9 668 -6 A
## 7 2013 1 1 6 EWR FLL Fort~ 26.1 -80.2 9 -5 A
## 8 2013 1 1 6 LGA IAD Wash~ 38.9 -77.5 313 -5 A
## 9 2013 1 1 6 JFK MCO Orla~ 28.4 -81.3 96 -5 A
## 10 2013 1 1 6 LGA ORD Chic~ 42.0 -87.9 668 -6 A
## # ... with 336,766 more rows, and 1 more variable: tzone <chr>
### Merge airport information based upon origin code
flights2 %>%
select(-tailnum, -carrier) %>%
left_join(airports, by = c("origin" = "faa"))
## # A tibble: 336,776 x 13
## year month day hour origin dest name lat lon alt tz dst
## <int> <int> <int> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 2013 1 1 5 EWR IAH Newa~ 40.7 -74.2 18 -5 A
## 2 2013 1 1 5 LGA IAH La G~ 40.8 -73.9 22 -5 A
## 3 2013 1 1 5 JFK MIA John~ 40.6 -73.8 13 -5 A
## 4 2013 1 1 5 JFK BQN John~ 40.6 -73.8 13 -5 A
## 5 2013 1 1 6 LGA ATL La G~ 40.8 -73.9 22 -5 A
## 6 2013 1 1 5 EWR ORD Newa~ 40.7 -74.2 18 -5 A
## 7 2013 1 1 6 EWR FLL Newa~ 40.7 -74.2 18 -5 A
## 8 2013 1 1 6 LGA IAD La G~ 40.8 -73.9 22 -5 A
## 9 2013 1 1 6 JFK MCO John~ 40.6 -73.8 13 -5 A
## 10 2013 1 1 6 LGA ORD La G~ 40.8 -73.9 22 -5 A
## # ... with 336,766 more rows, and 1 more variable: tzone <chr>
Filtering joins filter observations and do not change nor add variables. The two types of filtering joins are:
semi_join(x, y) that keeps all of the observations in x that have a match in y.
anti_join(x, y) that drops all of the observations in x that have a match in y.
Semi-joins are typically used to match summary information back to the original rows. An example would be finding flights to destinations with fewer than 52 flights per year — fewer than one per week.
### Determine destinations with too few flights
too_few <- flights %>%
group_by(dest) %>%
summarize(n_flights = n(), na.rm = TRUE) %>% # Beware of cancelled flights
filter(n_flights < 52)
too_few
## # A tibble: 10 x 3
## dest n_flights na.rm
## <chr> <int> <lgl>
## 1 ANC 8 TRUE
## 2 BZN 36 TRUE
## 3 EYW 17 TRUE
## 4 HDN 15 TRUE
## 5 JAC 25 TRUE
## 6 LEX 1 TRUE
## 7 LGA 1 TRUE
## 8 MTJ 15 TRUE
## 9 PSP 19 TRUE
## 10 SBN 10 TRUE
### Join with original flights data. This will exclude flights to other dests
flights %>%
semi_join(too_few) %>%
arrange(dest, year, month, day, dep_time)
## Joining, by = "dest"
## # A tibble: 147 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 7 6 1629 1615 14 1954 1953
## 2 2013 7 13 1618 1615 3 1955 1953
## 3 2013 7 20 1618 1615 3 2003 1953
## 4 2013 7 27 1617 1615 2 1906 1953
## 5 2013 8 3 1615 1615 0 2003 1953
## 6 2013 8 10 1613 1615 -2 1922 1953
## 7 2013 8 17 1740 1625 75 2042 2003
## 8 2013 8 24 1633 1625 8 1959 2003
## 9 2013 1 5 850 825 25 1132 1123
## 10 2013 1 12 825 825 0 1115 1123
## # ... with 137 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
names(too_few)
## [1] "dest" "n_flights" "na.rm"
names(airports)
## [1] "faa" "name" "lat" "lon" "alt" "tz" "dst" "tzone"
### Use a semi_join to keep only those airports that have too few flights
### and plot their locations and FAA codes
airports %>%
semi_join(too_few, by = c("faa" = "dest")) %>%
ggplot(aes(lon, lat)) +
borders("state") +
geom_point() +
coord_quickmap() +
geom_text(aes(label=faa))
Use of summarize instead of count is intended to help you deal with other summary statistics. This will be necessary if you choose to look at standard deviations, medians, etc.
When matched back to the flights data, we see that many of the destinations seem to have had limited runs (few months) within 2013. The single flight to LGA may have been an emergency or a “repositioning”.
The plot is nice, but not really that informative. Note that Palm Springs is one of the low usage airports.
The test points out a few problems that you may run into when using joins. Be sure to read what the authors have to say.
Since some of you have not dealt with sets, we will keep this section short. Note that the set operators work on multiple variables simultaneously.
First, some simple set stuff.
A <- c("a", "b", "c")
B <- c("c", "d", "d", "f")
S <- c("a", "b", "c", "d", "e", "f", "g")
intersect(A, B)
## [1] "c"
union(A, B)
## [1] "a" "b" "c" "d" "f"
setdiff(A, B)
## [1] "a" "b"
setdiff(B, A)
## [1] "d" "f"
intersect(A, S)
## [1] "a" "b" "c"
union(S, A)
## [1] "a" "b" "c" "d" "e" "f" "g"
setdiff(S, A)
## [1] "d" "e" "f" "g"
setdiff(A, S)
## character(0)
The book gives as an example data frames that contain multiple variables. A similar example would be
A <- tribble(
~x, ~y,
1, 1,
1, 2,
2, 2,
1, 2
)
B <- tribble(
~x, ~y,
1, 1,
2, 1,
2, 1
)
intersect(A, B)
## # A tibble: 1 x 2
## x y
## <dbl> <dbl>
## 1 1 1
union(A, B)
## # A tibble: 4 x 2
## x y
## <dbl> <dbl>
## 1 1 1
## 2 1 2
## 3 2 2
## 4 2 1
setdiff(B, A)
## # A tibble: 1 x 2
## x y
## <dbl> <dbl>
## 1 2 1
setdiff(A, B)
## # A tibble: 2 x 2
## x y
## <dbl> <dbl>
## 1 1 2
## 2 2 2
Note that duplicated observations are not counted as is proper with sets.