Chapter 5

Chapter 5 covers Exploratory Data Analysis (EDA). The authors refer back to previous chapters and use the techniques of data transformation and visualization covered there to explore data.

EDA is only one area of statistics. It falls within the broader area of descriptive statistics. While EDA may be used for prediction, the methods used for data collection typically do not support inferential applications. Beware of data scientists bearing p-values.

Introduction

The authors suggest that EDA is an iterative process:

  1. Generate questions about your data.
  2. Search for answers by visualizing, transforming, and modeling your data.
  3. Use what you learn to refine your questions and/or generate new questions.

While EDA may be iterative, their first step is flawed. The selection of data is dictated by the questions one wishes to answer. One would not expect data on genes and cancer to provide a model that is predictive of stock market performance.

The first step in any project is to determine the basic question(s) that we wish to answer. These lead to the data that we would like to collect. Lack of availability of data may require that we redefine our questions. This iterative process begins before we start to explore our data.

An article by Mikhail Popov, “Hiring a Data Scientist”, points out that successful data science requires knowledge of three things: computer science, stat/math, and the area of application. Without a reasonable understanding of the topic that you are trying to describe, you will never be able to define questions, collect appropriate data, or recognize patterns in the data.

Prerequisites

As noted above, in Chapter 5 the authors use the techniques learned in previous chapters to explore data. To do so we need to load the packages that make up the tidyverse.

  library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Now that we have access to packages like ggplot2 and dplyr, we are ready to explore some data.

Questions

The quotes that appear in the text are by two very big names in statistics. In particular, Tukey is seen by many as the father of EDA.

“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” —John Tukey

His statement reflects his experience of working in the field of statistics for many years.

The authors’ stated questions,

  1. What type of variation occurs within my variables?
  2. What type of covariation occurs between my variables?

are technically correct. However, they are not the questions that typically start a study. The questions, as stated, assume that the data have already been collected. For now, we will assume that we do, in fact, have a data set that contains some variables.

Quick Stat Terminolgy Review

Definition: The group/set of things/objects/individuals we are interested in is called the population.

Definition: A sample is the subset of the population that we actually observe/examine.

Definition: We collect data from/on objects. We call these objects cases, units, or (in the case of humans) subjects.

Definition: A variable represents the possible, varied outcomes, or states, of a case in an experiment or study.

Definition: A value is the actual, observed outcome.

Definition: An observation is a set of values obtained from an individual case. A case may provide multiple observations.

Definition: Ordinal data are data that can be ordered but do not take on specific numerical values.

Definition: Data are on a nominal scale if different values can be classified into categories, but the categories have no specific order.

Definition: Data that are ordinal or nominal are called categorical or qualitative — also known as factor in R.

Definition: Cardinal or quantitative data are data that are on a scale where it is meaningful to measure the distance between possible data values.

Definition: Quantitative data that take on a countable number of possible values are called discrete.

Defintion: Quantitative data that take on an uncountable number of possible outcomes are called continuous.

Definition: A response, outcome, or dependent variable contains information about the result of a study.

Definition: An explanatory or independent variable is used to try to explain the outcome variable.

Definition: A parameter is an unknown value which is descriptive of the population.

Definition: A statistic is a function of the sample data that does not depend upon any parameters. Statistics are often used to estimate parameters.

The books definition of tidy data is what is often refered to as flat data. It is what we hope to receive, but rarely do. Getting data into a tidy form requires wrangling (see Chapter 9).

Variation

Life would be much simpler if values for a variable did not vary. To determine the height of college students, we could measure a single student and be done. To understand the heights of students we need to understand how the values for different individuals/observations vary — their variability. A description of the distribution of the values can often be obtained graphically.

Visualizing Distributions

In R, categorical variables are stored as factor variables. The nice thing about this is that most of R’s functions recognize the factor data type and treat it appropriately.

We can create a variable, height, with values “tall”, “average”, and “short” for seven individuals. R views the height variable as a character variable.

  height <- c("tall", "short", "average", "average", "short", "average", "tall")
  is.character(height)
## [1] TRUE
  height
## [1] "tall"    "short"   "average" "average" "short"   "average" "tall"
  attributes(height)
## NULL

The variable can be converted to a factor variable using the as.factor function.

  ht <- as.factor(height)
  attributes(ht)
## $levels
## [1] "average" "short"   "tall"   
## 
## $class
## [1] "factor"
  ht
## [1] tall    short   average average short   average tall   
## Levels: average short tall

Note that R has ordered the factor levels alphabetically. If this is not the order we want, a couple of methods can be used to change the ordering.

  ### Reorder levels of a factor
  ht <- ordered(ht, levels=c("short", "average", "tall"))
  attributes(ht)
## $levels
## [1] "short"   "average" "tall"   
## 
## $class
## [1] "ordered" "factor"
  ht
## [1] tall    short   average average short   average tall   
## Levels: short < average < tall
  ### Convert to proper order the first time
  ht <- ordered(height, levels=c("short","average","tall"))
  attributes(ht)
## $levels
## [1] "short"   "average" "tall"   
## 
## $class
## [1] "ordered" "factor"
  ht
## [1] tall    short   average average short   average tall   
## Levels: short < average < tall

Again, R is smart. We should not use arithmetic on categorical data. R will not let us do this. Suppose that we code an ordinal variable as “1” through “3”. Many packages will improperly compute means and standard deviations. R will recognize the variable as a factor.

  x <- c(1, 3, 2, 2, 3, 1, 1, 1)  
  y <- ordered(x, levels=c(1, 2, 3))
  mean(x)
## [1] 1.75
  mean(y)
## Warning in mean.default(y): argument is not numeric or logical: returning NA
## [1] NA

The text takes a look at the diamonds data frame. This data frame contains information on the 4C’s (cut, carat, color, clarity) and the price of diamonds. The variable cut describes the quality of the way the diamond was cut.

The proper plot for showing the number, proportion, or percentage of observations that fall within each of the levels of a categorical variable is the bar chart.

  attributes(diamonds$cut)
## $class
## [1] "ordered" "factor" 
## 
## $levels
## [1] "Fair"      "Good"      "Very Good" "Premium"   "Ideal"
  ggplot(data = diamonds) + 
    geom_bar(mapping = aes(x = cut))

  ### Get the counts in each bar
  diamonds %>%
    count(cut)
## # A tibble: 5 × 2
##   cut           n
##   <ord>     <int>
## 1 Fair       1610
## 2 Good       4906
## 3 Very Good 12082
## 4 Premium   13791
## 5 Ideal     21551

Note that cut is an ordinal variable. The variable carat is quantitative. A reasonable method of displaying the distribution of the values of a quantitative variable is the histogram. We create a histogram of carat thusly,

  attributes(diamonds$carat)
## NULL
  ggplot(data = diamonds) + 
    geom_histogram(mapping = aes(x = carat))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  ### There are too many bins so try custom bins 
  ggplot(data = diamonds) + 
    geom_histogram(mapping = aes(x = carat), breaks=seq(0, 5, by=0.25))

  ggplot(data = diamonds) + 
    geom_dotplot(mapping = aes(x = carat))
## Bin width defaults to 1/30 of the range of the data. Pick better value with `binwidth`.

Tables of the number of observations in each bin can be easily computed.

  diamonds %>%
    count(cut_width(carat, width=0.25, boundary=0.0))
## # A tibble: 19 × 2
##    `cut_width(carat, width = 0.25, boundary = 0)`     n
##    <fct>                                          <int>
##  1 [0,0.25]                                         785
##  2 (0.25,0.5]                                     18147
##  3 (0.5,0.75]                                     11351
##  4 (0.75,1]                                        6155
##  5 (1,1.25]                                        9822
##  6 (1.25,1.5]                                      2238
##  7 (1.5,1.75]                                      3075
##  8 (1.75,2]                                         478
##  9 (2,2.25]                                        1524
## 10 (2.25,2.5]                                       239
## 11 (2.5,2.75]                                        83
## 12 (2.75,3]                                          11
## 13 (3,3.25]                                          21
## 14 (3.25,3.5]                                         2
## 15 (3.5,3.75]                                         3
## 16 (3.75,4]                                           1
## 17 (4,4.25]                                           3
## 18 (4.25,4.5]                                         1
## 19 (5,5.25]                                           1

Diamonds larger than three carats are not the norm. We can exclude them and replot.

  diamonds.sub3 <- diamonds %>%
    filter(carat < 3)
  ggplot(data = diamonds.sub3) +
      geom_histogram(mapping = aes(x = carat), binwidth = 0.1)

  diamonds %>%
    filter(carat < 3) %>%
  ggplot() +
      geom_histogram(mapping = aes(x = carat), binwidth = 0.1)

Note that there appear to be peaks at multiples of 1/2 carat. We might be interested in whether the distribution of diamonds by size is the same across cut. Plotting multiple bars on the same plot makes interpretation difficult. The text suggests using frequency polygraphs instead.

  ggplot(data = diamonds.sub3, mapping = aes(x = carat, color = cut)) +
    geom_freqpoly(binwidth = 0.1)

The bumps that we saw before are apparent in the frequency polygons. The patterns appear to be similar across cut. Because of the number of observations (`r nrow(diamonds)’), the authors suggestion of using a lot of narrow bins makes sense.

  ggplot(data = diamonds.sub3) +
    geom_histogram(aes(x=carat), binwidth = 0.01)

Many introductory stat texts use dotplots instead of overly narrow width histograms. These are easy to produce using ggplot2.

  ggplot(data=diamonds, aes(x = carat)) + 
    geom_dotplot(binwidth=0.05, method='histodot')

Does the y-axis make sense? A quick look at the documentation and comments out on the web will tell you that the y-axis problem is a “feature.” Apparently there is a problem with defining dot sizes post counting.

Unusual Values

Definition: An outlier is a point that does not fit the patterns in the rest of the data.

Often, outliers are seen at the extremes of the observed values of a variable. Typically, they are offset by “gaps.” The text indicates that finding outliers in large data sets can be difficult. The authors show how to zoom in on low frequency values using coord_cartesian.

  p <- ggplot(diamonds) +
         geom_histogram(mapping = aes(x = y), binwidth = 0.5)
  p

  p + coord_cartesian(ylim = c(0, 50))

What is meant by “x = y”? It turns out that there is a “y” variable in the diamonds data frame. The code uses the “y” variable as the x variable in the histogram.

If you enter “?diamonds” or “help(diamonds)” in the console, you can get a description of the data frame. Within this description you will find that “y” is the “width in mm (0–58.9)” of a diamond. Does this information make the “unusual” values more understandable?

The text shows how to look at the observations that contain the unusual values.

  (diamonds %>%
     filter(y < 3 | y > 20) %>%
     arrange(y)
  )
## # A tibble: 9 × 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  1    Very Good H     VS2      63.3    53  5139  0      0    0   
## 2  1.14 Fair      G     VS1      57.5    67  6381  0      0    0   
## 3  1.56 Ideal     G     VS2      62.2    54 12800  0      0    0   
## 4  1.2  Premium   D     VVS1     62.1    59 15686  0      0    0   
## 5  2.25 Premium   H     SI2      62.8    59 18034  0      0    0   
## 6  0.71 Good      F     SI2      64.1    60  2130  0      0    0   
## 7  0.71 Good      F     SI2      64.1    60  2130  0      0    0   
## 8  0.51 Ideal     E     VS1      61.8    55  2075  5.15  31.8  5.12
## 9  2    Premium   H     SI2      58.9    57 12210  8.09  58.9  8.06

The book suggests running analyses with and without the unusual observations to see how “influential” they are — how much they change the outcome. The authors then suggest dropping a value — replacing the value with NA — when the analysis is basically unchanged. While this may be standard practice for these authors, many regulatory agencies frown upon the practice.

Missing Values

The book suggests two methods for dealing with unusual values:

  1. Delete the observation — remove the row from the data frame.
  2. Replace the unusual value with NA.

Removal of the row can be done using filter. This method was covered in earlier sections.

  (diamonds2 <- diamonds %>%
       filter(between(y, 3, 20)))
## # A tibble: 53,931 × 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # … with 53,921 more rows
## # ℹ Use `print(n = ...)` to see more rows

Note that diamonds has 53940 observations and diamonds2 has 53931.

Replacing the unusual values requires the use of ifelse. You can learn more about ifelse by entering “?ifelse” into the console. Alternatively, you can try “args(ifelse)”.

  (diamonds2 <- diamonds %>%
     mutate(y = ifelse(y < 3 | y > 20, NA, y))
  )
## # A tibble: 53,940 × 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # … with 53,930 more rows
## # ℹ Use `print(n = ...)` to see more rows

As a reminder, diamonds has 53940 observations. Because we have replaced those values that are less than 3 or greater than 20 with NA, diamonds2 retains 53940 observations.

ggplot2 will warn you if you have missing values in your data. Removing the NAs using filter as above or through the use of na.rm = TRUE will supress the warning.

Remember that missing values may be missing for a reason. When asking people for information, a non-reponse is a response. The patterns in non-response are often as informative as the information that is collected from respondents. Always ask why a value is missing.

  ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
    geom_point()
## Warning: Removed 9 rows containing missing values (geom_point).

Note that ggplot has indicated that nine missing values were removed. To supress the warning, use na.rm.

  ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
    geom_point(na.rm = TRUE)

We get the same plot without the warning message.

Covariation

Covariation describes the relationship between two variables. In the plot above we see that x and y are positively associated — larger values of x are associated with larger values of y. Not only that, but the points seem to be linearly associated. While the scatterplot above works well for two quantitative variables, other plots are needed when looking at other types of variables.

A Categorical and Continuous Variable

Looking at the distribution of a quantitative variable for different levels of a categorical variable is a common task. When we lack balance across the categories — that is, the levels of the categorical variable contain different numbers of observations — comparison by counts can be difficult. In cases where the data are unbalanced, histograms and frequency polygons can fail.

We now turn to the text’s example of diamond price (quantitative) as a function of cut.

  ggplot(data = diamonds, mapping = aes(x = price)) +
    geom_freqpoly(mapping = aes(color = cut), binwidth = 500)

The default plot created above is counts based. Because there are a lot more “Ideal” cut than “Fair” cut, comparisons between cut levels are difficult. The authors suggest using density instead of count as the density essentially balances the data.

For those with a calculus background, a density is rescaled so that the integral over possible (continuous) data values will be one. For discrete data we use proportions which will sum to one.

The rescaled, density plots look like:

  ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) +
    geom_freqpoly(mapping = aes(color = cut), binwidth = 500)

While patterns across price are now easier to see — “Fair” has a peak at $2500 that other cuts cannot beat — we have thrown away the number of observations that support these “trends.”

A much more useful plot is the boxplot. The boxplot is based upon the five-number-summary of a quantitative variable.

  summary(diamonds$price)  # R base
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326     950    2401    3933    5324   18823
  fivenum(diamonds$price)  # R stats
## [1]   326.0   950.0  2401.0  5324.5 18823.0
  ggplot(data = diamonds) +
    geom_boxplot(mapping = aes(y = price))

We see that the features of the boxplot match up with the five number summary. However, the x-axis is a bit confusing. We remove the scale by plotting against nothing.

  ggplot(data = diamonds) +
    geom_boxplot(mapping = aes(x = "", y = price)) + xlab("") + coord_flip()

Now that we recognize what the boxplot is trying to show us, it is time to plot individual boxplots for each of the levels of cut side by side. Use of coord_flip makes reading the “long” cut values easier.

  ggplot(data = diamonds) +
    geom_boxplot(mapping = aes(x = cut, y = price)) + 
    coord_flip()

It is strange that the price of “Ideal” cut diamonds is lower (on median) than other cuts. We also see that “Fair” cut have the highest median. Setting the price of a diamond must be more complex than can be explained by cut alone.

The text notes that cut is ordinal and that plotting in the “natural” order shows a pattern — although it is not the one we expected. In some situations we may wish to use the values of the response variable to order the values of the explanatory variable. The example of highway gas mileage as a function of vehicle class shows how to rearrange the boxplots by increasing response median. Using this technique on the diamond data may provide additional insight into the pricing.

  ggplot(data = diamonds) +
    geom_boxplot(mapping = aes(x = reorder(cut, price, FUN = median), 
                               y = price
                              )
                ) + 
    coord_flip()

A variation of the boxplot, violin plots use density to provide a better idea of where points are within the quartiles.

  ggplot(data = diamonds) +
    geom_violin(mapping = aes(x = reorder(cut, price, FUN = median), 
                               y = price
                              ),
                draw_quantiles = c(0.25, 0.5, 0.75)
                ) +
    xlab("Cut") + 
    ylab("Price") +
    coord_flip()

Overlaying violin plots and boxplots is the best of both worlds.

  ggplot(data = diamonds) +
    geom_violin(mapping = aes(x = cut,  
                               y = price
                              )
                ) +
    geom_boxplot(mapping = aes(x = cut,  
                               y = price
                              ),
                 width = 0.05, color="skyblue"
                ) +  
    xlab("Cut") + 
    ylab("Price ($US)") #+ coord_flip()

Two Gategorical Variables

Analysis of categorical data is based upons counts. Whether we have one or two variables, we are still stuck looking at counts. The text suggests that we use geom_count. As in the text, we will continue to look at the cut and color variables from the diamonds data.

  ggplot(data = diamonds) + 
    geom_count(mapping = aes(x = cut, y = color))

The authors use of the word “correlation” when looking for patterns in categorical data is incorrect. Correlation describes the strength and direction of a linear association between two numeric variables.

Circle sizes make it possible to look for patterns that exist in categorical data. However, there are more meaningful ways of looking for these relationships. Perhaps the most simple, and also most informative, is a table of counts or proportions.

  ### Counts by dplyr
  diamonds %>%
    count(color, cut)
## # A tibble: 35 × 3
##    color cut           n
##    <ord> <ord>     <int>
##  1 D     Fair        163
##  2 D     Good        662
##  3 D     Very Good  1513
##  4 D     Premium    1603
##  5 D     Ideal      2834
##  6 E     Fair        224
##  7 E     Good        933
##  8 E     Very Good  2400
##  9 E     Premium    2337
## 10 E     Ideal      3903
## # … with 25 more rows
## # ℹ Use `print(n = ...)` to see more rows
  # or
  diamonds %>%
    group_by(color, cut)%>%
    summarize(n=n())
## `summarise()` has grouped output by 'color'. You can override using the
## `.groups` argument.
## # A tibble: 35 × 3
## # Groups:   color [7]
##    color cut           n
##    <ord> <ord>     <int>
##  1 D     Fair        163
##  2 D     Good        662
##  3 D     Very Good  1513
##  4 D     Premium    1603
##  5 D     Ideal      2834
##  6 E     Fair        224
##  7 E     Good        933
##  8 E     Very Good  2400
##  9 E     Premium    2337
## 10 E     Ideal      3903
## # … with 25 more rows
## # ℹ Use `print(n = ...)` to see more rows
  # or, maybe easier to read
  diamonds %>%
    group_by(color, cut) %>%
    summarise(n=n()) %>%
    spread(cut, n) 
## `summarise()` has grouped output by 'color'. You can override using the
## `.groups` argument.
## # A tibble: 7 × 6
## # Groups:   color [7]
##   color  Fair  Good `Very Good` Premium Ideal
##   <ord> <int> <int>       <int>   <int> <int>
## 1 D       163   662        1513    1603  2834
## 2 E       224   933        2400    2337  3903
## 3 F       312   909        2164    2331  3826
## 4 G       314   871        2299    2924  4884
## 5 H       303   702        1824    2360  3115
## 6 I       175   522        1204    1428  2093
## 7 J       119   307         678     808   896
  ### Counts by base R
  table(diamonds[,c("color", "cut")])
##      cut
## color Fair Good Very Good Premium Ideal
##     D  163  662      1513    1603  2834
##     E  224  933      2400    2337  3903
##     F  312  909      2164    2331  3826
##     G  314  871      2299    2924  4884
##     H  303  702      1824    2360  3115
##     I  175  522      1204    1428  2093
##     J  119  307       678     808   896
  ### Proportion by dplyr
  (diam.dplyrprop <- diamonds%>%
    group_by(cut, color)%>%
    summarize(n=n())%>%
    mutate(prop=n/sum(n))%>%
    subset(select=c("cut","color","prop"))%>%   #drop the frequency value
    spread(cut, prop)
   )
## `summarise()` has grouped output by 'cut'. You can override using the `.groups`
## argument.
## # A tibble: 7 × 6
##   color   Fair   Good `Very Good` Premium  Ideal
##   <ord>  <dbl>  <dbl>       <dbl>   <dbl>  <dbl>
## 1 D     0.101  0.135       0.125   0.116  0.132 
## 2 E     0.139  0.190       0.199   0.169  0.181 
## 3 F     0.194  0.185       0.179   0.169  0.178 
## 4 G     0.195  0.178       0.190   0.212  0.227 
## 5 H     0.188  0.143       0.151   0.171  0.145 
## 6 I     0.109  0.106       0.0997  0.104  0.0971
## 7 J     0.0739 0.0626      0.0561  0.0586 0.0416
  ### Proportion by hand using base R
  print(round(diam.tableprop <- table(diamonds[,c("color", "cut")])/nrow(diamonds), digits=3))
##      cut
## color  Fair  Good Very Good Premium Ideal
##     D 0.003 0.012     0.028   0.030 0.053
##     E 0.004 0.017     0.044   0.043 0.072
##     F 0.006 0.017     0.040   0.043 0.071
##     G 0.006 0.016     0.043   0.054 0.091
##     H 0.006 0.013     0.034   0.044 0.058
##     I 0.003 0.010     0.022   0.026 0.039
##     J 0.002 0.006     0.013   0.015 0.017

Note that the proportions should sum to one. Looking at the proportions in the first row of the dplyr table, we see that the sum is greater than one. For the table table, the sum of the entire table is one.

  ### Table sum of dplyr proportions
  sum(diam.dplyrprop[-1]) # drop the label column (col 1)
## [1] 5
  ### Table sum of "table" proportions
  sum(diam.tableprop)
## [1] 1
  ### Row sums of dplyr proportions
  apply(diam.dplyrprop[-1], 1, sum) 
## [1] 0.6091439 0.8785120 0.9047372 1.0014994 0.7979242 0.5154126 0.2927707
  ### Row sums of "table" proportions
  apply(diam.tableprop, 1, sum)
##          D          E          F          G          H          I          J 
## 0.12560252 0.18162773 0.17690026 0.20934372 0.15394883 0.10051910 0.05205784
  ### Column sums of dplyr proportions
  apply(diam.dplyrprop[-1], 2, sum) 
##      Fair      Good Very Good   Premium     Ideal 
##         1         1         1         1         1
  ### Column sums of "table" proportions
  apply(diam.tableprop, 2, sum)
##       Fair       Good  Very Good    Premium      Ideal 
## 0.02984798 0.09095291 0.22398962 0.25567297 0.39953652
  ### Clean up
  rm("diam.tableprop", "diam.dplyprop")
## Warning in rm("diam.tableprop", "diam.dplyprop"): object 'diam.dplyprop' not
## found

We see that the proportions generated by hand using table are table proportions. Those generated using dplyr are column proportions. Apparently, we need to be careful how we compute our proportions. Thankfully, table and a few associated functions make this easy.

  ### Generate counts table
  (diam.table <- table(diamonds$cut, diamonds$color)
  )
##            
##                D    E    F    G    H    I    J
##   Fair       163  224  312  314  303  175  119
##   Good       662  933  909  871  702  522  307
##   Very Good 1513 2400 2164 2299 1824 1204  678
##   Premium   1603 2337 2331 2924 2360 1428  808
##   Ideal     2834 3903 3826 4884 3115 2093  896
  # or use ftable on the table
  ftable(diam.table)
##               D    E    F    G    H    I    J
##                                              
## Fair        163  224  312  314  303  175  119
## Good        662  933  909  871  702  522  307
## Very Good  1513 2400 2164 2299 1824 1204  678
## Premium    1603 2337 2331 2924 2360 1428  808
## Ideal      2834 3903 3826 4884 3115 2093  896
  ### Get marginal distribution of cut (rows = dim 1)
  margin.table(diam.table, 1)
## 
##      Fair      Good Very Good   Premium     Ideal 
##      1610      4906     12082     13791     21551
  ### Get marginal distributon of color (columns = dim 2)
  margin.table(diam.table, 2)
## 
##     D     E     F     G     H     I     J 
##  6775  9797  9542 11292  8304  5422  2808
  ### Get table proportions
  (temp <- prop.table(diam.table))
##            
##                       D           E           F           G           H
##   Fair      0.003021876 0.004152762 0.005784205 0.005821283 0.005617353
##   Good      0.012272896 0.017296997 0.016852058 0.016147571 0.013014461
##   Very Good 0.028049685 0.044493882 0.040118650 0.042621431 0.033815350
##   Premium   0.029718205 0.043325918 0.043214683 0.054208380 0.043752317
##   Ideal     0.052539859 0.072358176 0.070930664 0.090545050 0.057749351
##            
##                       I           J
##   Fair      0.003244346 0.002206155
##   Good      0.009677419 0.005691509
##   Very Good 0.022321098 0.012569522
##   Premium   0.026473860 0.014979607
##   Ideal     0.038802373 0.016611049
  sum(temp)
## [1] 1
  ### Get row proportions
  (temp <- prop.table(diam.table, 1))
##            
##                      D          E          F          G          H          I
##   Fair      0.10124224 0.13913043 0.19378882 0.19503106 0.18819876 0.10869565
##   Good      0.13493681 0.19017530 0.18528333 0.17753771 0.14309009 0.10640033
##   Very Good 0.12522761 0.19864261 0.17910942 0.19028307 0.15096838 0.09965238
##   Premium   0.11623523 0.16945834 0.16902328 0.21202233 0.17112610 0.10354579
##   Ideal     0.13150202 0.18110529 0.17753237 0.22662521 0.14454086 0.09711846
##            
##                      J
##   Fair      0.07391304
##   Good      0.06257644
##   Very Good 0.05611654
##   Premium   0.05858893
##   Ideal     0.04157580
  apply(temp, 1, sum)
##      Fair      Good Very Good   Premium     Ideal 
##         1         1         1         1         1
  ### Get column proportions
  (temp <- prop.table(diam.table, 2))
##            
##                      D          E          F          G          H          I
##   Fair      0.02405904 0.02286414 0.03269755 0.02780730 0.03648844 0.03227591
##   Good      0.09771218 0.09523323 0.09526305 0.07713425 0.08453757 0.09627444
##   Very Good 0.22332103 0.24497295 0.22678684 0.20359547 0.21965318 0.22205828
##   Premium   0.23660517 0.23854241 0.24428841 0.25894439 0.28420039 0.26337145
##   Ideal     0.41830258 0.39838726 0.40096416 0.43251860 0.37512042 0.38601992
##            
##                      J
##   Fair      0.04237892
##   Good      0.10933048
##   Very Good 0.24145299
##   Premium   0.28774929
##   Ideal     0.31908832
  apply(temp, 2, sum)
## D E F G H I J 
## 1 1 1 1 1 1 1
  ### Use CrossTable to get it all at once
  if (!require(gmodels)){
    install.packages("gmodels", dependencies=TRUE)
    load(gmodels)
  }
## Loading required package: gmodels
  CrossTable(diamonds$cut, diamonds$color, prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  53940 
## 
##  
##              | diamonds$color 
## diamonds$cut |         D |         E |         F |         G |         H |         I |         J | Row Total | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##         Fair |       163 |       224 |       312 |       314 |       303 |       175 |       119 |      1610 | 
##              |     0.101 |     0.139 |     0.194 |     0.195 |     0.188 |     0.109 |     0.074 |     0.030 | 
##              |     0.024 |     0.023 |     0.033 |     0.028 |     0.036 |     0.032 |     0.042 |           | 
##              |     0.003 |     0.004 |     0.006 |     0.006 |     0.006 |     0.003 |     0.002 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##         Good |       662 |       933 |       909 |       871 |       702 |       522 |       307 |      4906 | 
##              |     0.135 |     0.190 |     0.185 |     0.178 |     0.143 |     0.106 |     0.063 |     0.091 | 
##              |     0.098 |     0.095 |     0.095 |     0.077 |     0.085 |     0.096 |     0.109 |           | 
##              |     0.012 |     0.017 |     0.017 |     0.016 |     0.013 |     0.010 |     0.006 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##    Very Good |      1513 |      2400 |      2164 |      2299 |      1824 |      1204 |       678 |     12082 | 
##              |     0.125 |     0.199 |     0.179 |     0.190 |     0.151 |     0.100 |     0.056 |     0.224 | 
##              |     0.223 |     0.245 |     0.227 |     0.204 |     0.220 |     0.222 |     0.241 |           | 
##              |     0.028 |     0.044 |     0.040 |     0.043 |     0.034 |     0.022 |     0.013 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##      Premium |      1603 |      2337 |      2331 |      2924 |      2360 |      1428 |       808 |     13791 | 
##              |     0.116 |     0.169 |     0.169 |     0.212 |     0.171 |     0.104 |     0.059 |     0.256 | 
##              |     0.237 |     0.239 |     0.244 |     0.259 |     0.284 |     0.263 |     0.288 |           | 
##              |     0.030 |     0.043 |     0.043 |     0.054 |     0.044 |     0.026 |     0.015 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##        Ideal |      2834 |      3903 |      3826 |      4884 |      3115 |      2093 |       896 |     21551 | 
##              |     0.132 |     0.181 |     0.178 |     0.227 |     0.145 |     0.097 |     0.042 |     0.400 | 
##              |     0.418 |     0.398 |     0.401 |     0.433 |     0.375 |     0.386 |     0.319 |           | 
##              |     0.053 |     0.072 |     0.071 |     0.091 |     0.058 |     0.039 |     0.017 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total |      6775 |      9797 |      9542 |     11292 |      8304 |      5422 |      2808 |     53940 | 
##              |     0.126 |     0.182 |     0.177 |     0.209 |     0.154 |     0.101 |     0.052 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 
## 
  ### Cleanup
  rm("temp", "diam.table")

We now have counts and proportions that describe different relationships between two categorical variables. The book shows how to plot the frequencies using a “heat map.”

  diamonds %>%
    count(color, cut) %>%
    ggplot(mapping = aes(x = color, y = cut)) +
      geom_tile(mapping = aes(fill = n))

The mosaic package provides a more flexible method for dealing with categorical data. The package is capable of looking at more than two variables at a time.

  ### Set up colors to reflect that yellow diamonds are not prized
  ramp <- colorRamp(c("white", "light yellow"))
  diamond.colors <- rgb( ramp(seq(0, 1, length = 8)), max = 255)
  mosaicplot(table(diamonds[,c("cut","color")]), 
             main="Diamonds",
             col = diamond.colors
  )

  ### Set up colors to reflect that inclusions are not prized
  ramp <- colorRamp(c("black", "white"))
  diamond.colors <- rgb( ramp(seq(0, 1, length = 5)), max = 255)
  
  mosaicplot(table(diamonds$cut, 
                   diamonds$color, 
                   factor(substr(diamonds$clarity, 1,2),      
                           levels=c("I1","SI","VS","VV","IF")
                   )
             ), 
             main="Diamonds",
             col = diamond.colors
  )

A word of warning. For very large data sets, table based methods can be slower than dplyr based approaches. Of course, the table methods are more predictable in their proportions and, once the base table is created, marginal distributions are quick.

Two Continuous Variables

The base plot for looking at the relationship between two continuous variables is the scatterplot. Continuing the diamond price analysis, we can check to see if size matters.

  ggplot(data = diamonds) +
    geom_point(mapping = aes(x = carat, y = price),
               alpha = 0.01
              )

One method for reducing data is the use of binning. Two methods — one based on rectangular bins and the other on hexagonal bins — are shown. Output from the methods are presented below.

The text suggests categorizing carat and plotting using boxplots. Hidden in the text is a suggestion to use varwidth=TRUE to compensate for different numbers of observations withing each of the bins.

  ggplot(data=diamonds2) +
    geom_bin2d(mapping = aes(x = carat, y = price),
               bins = 50)

  ### Install or load hexbin
  if (!require(hexbin)){
       install.packages("hexbin", dependencies = TRUE)
       require(hexbin)
     }
## Loading required package: hexbin
  ggplot(data=diamonds2) +
    geom_hex(mapping = aes(x = carat, y = price),
             bins=50
             )

The text suggests categorizing carat and plotting using boxplots. Hidden in the text is a suggestion to use varwidth=TRUE to visualize the different number of observations within each of the bins.

  ### Equal width with variable number of observations
  ggplot(data = diamonds2, mapping = aes(x = carat, y = price)) +
    geom_boxplot(mapping = aes(group = cut_width(carat, 0.2)),
                 varwidth = TRUE
                )

  ### Equal number of observations with variable width
  ###    Use 25 intervals
  ggplot(data = diamonds2, mapping = aes(x = carat, y = price)) +
    geom_boxplot(mapping = aes(group = cut_number(carat, 25)),
                 varwidth = TRUE
                )

Patterns and Models

As noted in the text, EDA is about looking for patterns which are indicative of relationships. The authors list a number of questions that they suggest you ask yourself should you find a pattern:

  • Is the pattern due to random chance?
  • What is the relationship described by the pattern?
  • How strong is the relationship?
  • Which other variables might affect the relationship?
  • Is the relationship different within subgroups?

Some of these questions suggest that the relationship is causal. As noted earlier, causal inference requires random sampling from the population of interest. One should always be careful that the method of collection has not induced bias. Serial autocorrelation is often a “pattern” observed in data that are collected longitudinally — that is, over time.

One of the examples that the text provides is an attempt to explain the cost of diamonds. The authors choose to start by fitting a line through the scatterplot of price as a function of size that we viewed above. Fitting regression lines without a careful analysis of residuals is a dangerous practice.

  p <- ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
       geom_point(alpha = 0.01)
  p

The raw points do not appear to be linearly associated. We can add a couple of smooths to check this.

  p + geom_smooth(method = "lm", color = "blue") +
      geom_smooth(method = "lm", formula = y ~ exp(x), color = "red") +
      geom_smooth(method = "lm", formula = y ~ poly(x, 5), color = "green") +
      coord_cartesian(ylim=c(0,40000))
## `geom_smooth()` using formula 'y ~ x'

Because we can, let us take a cursory look at the quality of the linear regression model. The easiest way to do this is to plot the residuals.

  ### Install or load the modelr package
  if (!require(modelr)){
    install.packages("modelr", dependencies = TRUE)
    require(modelr)
  }
## Loading required package: modelr
  ### Fit the linear model of price as a function of carat
  price.lm.carat <- lm(price ~ carat, data = diamonds)
  price.lm.carat # blue line above
## 
## Call:
## lm(formula = price ~ carat, data = diamonds)
## 
## Coefficients:
## (Intercept)        carat  
##       -2256         7756
  ### Create new tibble with residuals added
  diamonds2 <- diamonds %>%
    add_residuals(price.lm.carat)
  
  ### Plot the residuals
  ggplot(data = diamonds2, mapping = aes(x = carat, y = resid)) +
    geom_point() +
    geom_hline(yintercept =  0, color="lightblue")

It is pretty clear that the simple linear model is not a good fit for this data. The decreasing pattern that is apparent indicates that carat is not describing all of a diamond’s price.

Before fitting a line to the data, the authors transform both variables by taking logarithms. The new plot looks like.

  p <- ggplot(data = diamonds, mapping = aes(x = log(carat), y = log(price))) +
       geom_point(alpha = 0.01)
  p

While the plot appears to be more linear than that of the non-transformed data, it maintains some banding and other “patterns” that need to be accounted for. Fitting a line to the log data produces the plot below.

  p + geom_smooth(method = "lm", color = "blue") 
## `geom_smooth()` using formula 'y ~ x'

The residuals from this model look like:

  ### Install or load the modelr package
  if (!require(modelr)){
    install.packages("modelr", dependencies = TRUE)
    require(modelr)
  }

  ### Fit the linear model of price as a function of carat
  (logprice.lm.logcarat <- lm(log(price) ~ log(carat), data = diamonds))
## 
## Call:
## lm(formula = log(price) ~ log(carat), data = diamonds)
## 
## Coefficients:
## (Intercept)   log(carat)  
##       8.449        1.676
  ### Create new tibble with residuals added
  diamonds2 <- diamonds %>%
    add_residuals(logprice.lm.logcarat)
  
  ### Plot the residuals
  ggplot(data = diamonds2, mapping = aes(x = log(carat), y = resid)) +
    geom_point() +
    geom_hline(yintercept =  0, color="lightblue")

The log-log model still seems to have some patterns. There appears to be some clumping and the variability seems to increase as log(carat) increases. Note that the authors have plotted the residuals using their natural scale — \(e^{\ln(y)} = y\) and they use carat and not log(carat) on the x-axis.

  ### Create a transformed variable
  diamonds2 <- diamonds2 %>%
    mutate(expresid = exp(resid))

  ### Plot the transformed residuals
  ggplot(data = diamonds2, mapping = aes(x = carat, y = expresid)) +
    geom_point() +
    geom_hline(yintercept =  0, color="lightblue")

As the text notes, the transformed residuals suggest that smaller diamonds are being overpriced. The reasons for this may be held in other variables. The authors use boxplots to check the relationship between cut and price having corrected for carat . We can also look at color and clairty using the same approach.

  p <- ggplot(diamonds2) 

  ### Look at cut having corrected for carat
  p + geom_boxplot(aes(x = cut, y = expresid))

  ### Look at color having corrected for carat
  p + geom_boxplot(aes(x = color, y = expresid))

  ### Look at clarity having corrected for carat
  p + geom_boxplot(aes(x = clarity, y = expresid))

All three variables seem to indicate higher prices associated with better quality once we have corrected for size.

On a programming note, as the book indicates we do not have to include argument names with the arguments — provided you enter the arguments in the correct order. To determine proper order you can enter “?function” or “args(function)” in the console.

  args(ggplot)
## function (data = NULL, mapping = aes(), ..., environment = parent.frame()) 
## NULL
  args(geom_point)
## function (mapping = NULL, data = NULL, stat = "identity", position = "identity", 
##     ..., na.rm = FALSE, show.legend = NA, inherit.aes = TRUE) 
## NULL

Remember that someone else may have to read your code, so the inclusion of argument names may not be such a bad thing.

Another plot may help us see two variables at a time in a single plot.

  p <- ggplot(diamonds, aes(x = carat, y = price, color = clarity)) +
    geom_point(aes(shape = cut), alpha=0.1) +
    geom_smooth(method = "lm")
  p
## Warning: Using shapes for an ordinal variable is not advised
## `geom_smooth()` using formula 'y ~ x'

ggplot2 uses facets to maintain the lattice and trellis plots that were first used in R and S/Plus (respectively). With facets we can include additional variables in our plots.

  p <- ggplot(diamonds, aes(x = carat, y = price, color = clarity)) +
    geom_point(alpha=0.1) +
    geom_smooth(method = "lm")

  p + facet_grid(rows = vars(cut)) + theme(aspect.ratio = 1)
## `geom_smooth()` using formula 'y ~ x'

  p + facet_grid(cols = vars(color)) + theme(aspect.ratio = 1)
## `geom_smooth()` using formula 'y ~ x'

  p + facet_grid(rows = vars(cut), cols = vars(color)) + theme(aspect.ratio = 1)
## `geom_smooth()` using formula 'y ~ x'
## Warning in qt((1 - level)/2, df): NaNs produced

## Warning in qt((1 - level)/2, df): NaNs produced
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

  p + facet_grid(rows = vars(cut, color %in% c("D","E","F"))) + theme(aspect.ratio = 0.75)
## `geom_smooth()` using formula 'y ~ x'
## Warning in qt((1 - level)/2, df): NaNs produced

## Warning in qt((1 - level)/2, df): no non-missing arguments to max; returning
## -Inf

We will look at fitting models in the future.