Chapter 5 covers Exploratory Data Analysis (EDA). The authors refer back to previous chapters and use the techniques of data transformation and visualization covered there to explore data.
EDA is only one area of statistics. It falls within the broader area of descriptive statistics. While EDA may be used for prediction, the methods used for data collection typically do not support inferential applications. Beware of data scientists bearing p-values.
The authors suggest that EDA is an iterative process:
While EDA may be iterative, their first step is flawed. The selection of data is dictated by the questions one wishes to answer. One would not expect data on genes and cancer to provide a model that is predictive of stock market performance.
The first step in any project is to determine the basic question(s) that we wish to answer. These lead to the data that we would like to collect. Lack of availability of data may require that we redefine our questions. This iterative process begins before we start to explore our data.
An article by Mikhail Popov, “Hiring a Data Scientist”, points out that successful data science requires knowledge of three things: computer science, stat/math, and the area of application. Without a reasonable understanding of the topic that you are trying to describe, you will never be able to define questions, collect appropriate data, or recognize patterns in the data.
As noted above, in Chapter 5 the authors use the techniques learned in previous chapters to explore data. To do so we need to load the packages that make up the tidyverse.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Now that we have access to packages like ggplot2 and dplyr, we are ready to explore some data.
The quotes that appear in the text are by two very big names in statistics. In particular, Tukey is seen by many as the father of EDA.
“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” —John Tukey
His statement reflects his experience of working in the field of statistics for many years.
The authors’ stated questions,
are technically correct. However, they are not the questions that typically start a study. The questions, as stated, assume that the data have already been collected. For now, we will assume that we do, in fact, have a data set that contains some variables.
Definition: The group/set of things/objects/individuals we are interested in is called the population.
Definition: A sample is the subset of the population that we actually observe/examine.
Definition: We collect data from/on objects. We call these objects cases, units, or (in the case of humans) subjects.
Definition: A variable represents the possible, varied outcomes, or states, of a case in an experiment or study.
Definition: A value is the actual, observed outcome.
Definition: An observation is a set of values obtained from an individual case. A case may provide multiple observations.
Definition: Ordinal data are data that can be ordered but do not take on specific numerical values.
Definition: Data are on a nominal scale if different values can be classified into categories, but the categories have no specific order.
Definition: Data that are ordinal or nominal are called categorical or qualitative — also known as factor in R.
Definition: Cardinal or quantitative data are data that are on a scale where it is meaningful to measure the distance between possible data values.
Definition: Quantitative data that take on a countable number of possible values are called discrete.
Defintion: Quantitative data that take on an uncountable number of possible outcomes are called continuous.
Definition: A response, outcome, or dependent variable contains information about the result of a study.
Definition: An explanatory or independent variable is used to try to explain the outcome variable.
Definition: A parameter is an unknown value which is descriptive of the population.
Definition: A statistic is a function of the sample data that does not depend upon any parameters. Statistics are often used to estimate parameters.
The books definition of tidy data is what is often refered to as flat data. It is what we hope to receive, but rarely do. Getting data into a tidy form requires wrangling (see Chapter 9).
Life would be much simpler if values for a variable did not vary. To determine the height of college students, we could measure a single student and be done. To understand the heights of students we need to understand how the values for different individuals/observations vary — their variability. A description of the distribution of the values can often be obtained graphically.
In R, categorical variables are stored as factor variables. The nice thing about this is that most of R’s functions recognize the factor data type and treat it appropriately.
We can create a variable, height, with values “tall”, “average”, and “short” for seven individuals. R views the height variable as a character variable.
height <- c("tall", "short", "average", "average", "short", "average", "tall")
is.character(height)
## [1] TRUE
height
## [1] "tall" "short" "average" "average" "short" "average" "tall"
attributes(height)
## NULL
The variable can be converted to a factor variable using the as.factor function.
ht <- as.factor(height)
attributes(ht)
## $levels
## [1] "average" "short" "tall"
##
## $class
## [1] "factor"
ht
## [1] tall short average average short average tall
## Levels: average short tall
Note that R has ordered the factor levels alphabetically. If this is not the order we want, a couple of methods can be used to change the ordering.
### Reorder levels of a factor
ht <- ordered(ht, levels=c("short", "average", "tall"))
attributes(ht)
## $levels
## [1] "short" "average" "tall"
##
## $class
## [1] "ordered" "factor"
ht
## [1] tall short average average short average tall
## Levels: short < average < tall
### Convert to proper order the first time
ht <- ordered(height, levels=c("short","average","tall"))
attributes(ht)
## $levels
## [1] "short" "average" "tall"
##
## $class
## [1] "ordered" "factor"
ht
## [1] tall short average average short average tall
## Levels: short < average < tall
Again, R is smart. We should not use arithmetic on categorical data. R will not let us do this. Suppose that we code an ordinal variable as “1” through “3”. Many packages will improperly compute means and standard deviations. R will recognize the variable as a factor.
x <- c(1, 3, 2, 2, 3, 1, 1, 1)
y <- ordered(x, levels=c(1, 2, 3))
mean(x)
## [1] 1.75
mean(y)
## Warning in mean.default(y): argument is not numeric or logical: returning NA
## [1] NA
The text takes a look at the diamonds data frame. This data frame contains information on the 4C’s (cut, carat, color, clarity) and the price of diamonds. The variable cut describes the quality of the way the diamond was cut.
The proper plot for showing the number, proportion, or percentage of observations that fall within each of the levels of a categorical variable is the bar chart.
attributes(diamonds$cut)
## $class
## [1] "ordered" "factor"
##
## $levels
## [1] "Fair" "Good" "Very Good" "Premium" "Ideal"
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
### Get the counts in each bar
diamonds %>%
count(cut)
## # A tibble: 5 × 2
## cut n
## <ord> <int>
## 1 Fair 1610
## 2 Good 4906
## 3 Very Good 12082
## 4 Premium 13791
## 5 Ideal 21551
Note that cut is an ordinal variable. The variable carat is quantitative. A reasonable method of displaying the distribution of the values of a quantitative variable is the histogram. We create a histogram of carat thusly,
attributes(diamonds$carat)
## NULL
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
### There are too many bins so try custom bins
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), breaks=seq(0, 5, by=0.25))
ggplot(data = diamonds) +
geom_dotplot(mapping = aes(x = carat))
## Bin width defaults to 1/30 of the range of the data. Pick better value with `binwidth`.
Tables of the number of observations in each bin can be easily computed.
diamonds %>%
count(cut_width(carat, width=0.25, boundary=0.0))
## # A tibble: 19 × 2
## `cut_width(carat, width = 0.25, boundary = 0)` n
## <fct> <int>
## 1 [0,0.25] 785
## 2 (0.25,0.5] 18147
## 3 (0.5,0.75] 11351
## 4 (0.75,1] 6155
## 5 (1,1.25] 9822
## 6 (1.25,1.5] 2238
## 7 (1.5,1.75] 3075
## 8 (1.75,2] 478
## 9 (2,2.25] 1524
## 10 (2.25,2.5] 239
## 11 (2.5,2.75] 83
## 12 (2.75,3] 11
## 13 (3,3.25] 21
## 14 (3.25,3.5] 2
## 15 (3.5,3.75] 3
## 16 (3.75,4] 1
## 17 (4,4.25] 3
## 18 (4.25,4.5] 1
## 19 (5,5.25] 1
Diamonds larger than three carats are not the norm. We can exclude them and replot.
diamonds.sub3 <- diamonds %>%
filter(carat < 3)
ggplot(data = diamonds.sub3) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.1)
diamonds %>%
filter(carat < 3) %>%
ggplot() +
geom_histogram(mapping = aes(x = carat), binwidth = 0.1)
Note that there appear to be peaks at multiples of 1/2 carat. We might be interested in whether the distribution of diamonds by size is the same across cut. Plotting multiple bars on the same plot makes interpretation difficult. The text suggests using frequency polygraphs instead.
ggplot(data = diamonds.sub3, mapping = aes(x = carat, color = cut)) +
geom_freqpoly(binwidth = 0.1)
The bumps that we saw before are apparent in the frequency polygons. The patterns appear to be similar across cut. Because of the number of observations (`r nrow(diamonds)’), the authors suggestion of using a lot of narrow bins makes sense.
ggplot(data = diamonds.sub3) +
geom_histogram(aes(x=carat), binwidth = 0.01)
Many introductory stat texts use dotplots instead of overly narrow width histograms. These are easy to produce using ggplot2.
ggplot(data=diamonds, aes(x = carat)) +
geom_dotplot(binwidth=0.05, method='histodot')
Does the y-axis make sense? A quick look at the documentation and comments out on the web will tell you that the y-axis problem is a “feature.” Apparently there is a problem with defining dot sizes post counting.
Definition: An outlier is a point that does not fit the patterns in the rest of the data.
Often, outliers are seen at the extremes of the observed values of a variable. Typically, they are offset by “gaps.” The text indicates that finding outliers in large data sets can be difficult. The authors show how to zoom in on low frequency values using coord_cartesian.
p <- ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5)
p
p + coord_cartesian(ylim = c(0, 50))
What is meant by “x = y”? It turns out that there is a “y” variable in the diamonds data frame. The code uses the “y” variable as the x variable in the histogram.
If you enter “?diamonds” or “help(diamonds)” in the console, you can get a description of the data frame. Within this description you will find that “y” is the “width in mm (0–58.9)” of a diamond. Does this information make the “unusual” values more understandable?
The text shows how to look at the observations that contain the unusual values.
(diamonds %>%
filter(y < 3 | y > 20) %>%
arrange(y)
)
## # A tibble: 9 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 1 Very Good H VS2 63.3 53 5139 0 0 0
## 2 1.14 Fair G VS1 57.5 67 6381 0 0 0
## 3 1.56 Ideal G VS2 62.2 54 12800 0 0 0
## 4 1.2 Premium D VVS1 62.1 59 15686 0 0 0
## 5 2.25 Premium H SI2 62.8 59 18034 0 0 0
## 6 0.71 Good F SI2 64.1 60 2130 0 0 0
## 7 0.71 Good F SI2 64.1 60 2130 0 0 0
## 8 0.51 Ideal E VS1 61.8 55 2075 5.15 31.8 5.12
## 9 2 Premium H SI2 58.9 57 12210 8.09 58.9 8.06
The book suggests running analyses with and without the unusual observations to see how “influential” they are — how much they change the outcome. The authors then suggest dropping a value — replacing the value with NA — when the analysis is basically unchanged. While this may be standard practice for these authors, many regulatory agencies frown upon the practice.
The book suggests two methods for dealing with unusual values:
Removal of the row can be done using filter. This method was covered in earlier sections.
(diamonds2 <- diamonds %>%
filter(between(y, 3, 20)))
## # A tibble: 53,931 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## # … with 53,921 more rows
## # ℹ Use `print(n = ...)` to see more rows
Note that diamonds has 53940 observations and diamonds2 has 53931.
Replacing the unusual values requires the use of ifelse. You can learn more about ifelse by entering “?ifelse” into the console. Alternatively, you can try “args(ifelse)”.
(diamonds2 <- diamonds %>%
mutate(y = ifelse(y < 3 | y > 20, NA, y))
)
## # A tibble: 53,940 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## # … with 53,930 more rows
## # ℹ Use `print(n = ...)` to see more rows
As a reminder, diamonds has 53940 observations. Because we have replaced those values that are less than 3 or greater than 20 with NA, diamonds2 retains 53940 observations.
ggplot2 will warn you if you have missing values in your data. Removing the NAs using filter as above or through the use of na.rm = TRUE will supress the warning.
Remember that missing values may be missing for a reason. When asking people for information, a non-reponse is a response. The patterns in non-response are often as informative as the information that is collected from respondents. Always ask why a value is missing.
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
geom_point()
## Warning: Removed 9 rows containing missing values (geom_point).
Note that ggplot has indicated that nine missing values were removed. To supress the warning, use na.rm.
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
geom_point(na.rm = TRUE)
We get the same plot without the warning message.
Covariation describes the relationship between two variables. In the plot above we see that x and y are positively associated — larger values of x are associated with larger values of y. Not only that, but the points seem to be linearly associated. While the scatterplot above works well for two quantitative variables, other plots are needed when looking at other types of variables.
Looking at the distribution of a quantitative variable for different levels of a categorical variable is a common task. When we lack balance across the categories — that is, the levels of the categorical variable contain different numbers of observations — comparison by counts can be difficult. In cases where the data are unbalanced, histograms and frequency polygons can fail.
We now turn to the text’s example of diamond price (quantitative) as a function of cut.
ggplot(data = diamonds, mapping = aes(x = price)) +
geom_freqpoly(mapping = aes(color = cut), binwidth = 500)
The default plot created above is counts based. Because there are a lot more “Ideal” cut than “Fair” cut, comparisons between cut levels are difficult. The authors suggest using density instead of count as the density essentially balances the data.
For those with a calculus background, a density is rescaled so that the integral over possible (continuous) data values will be one. For discrete data we use proportions which will sum to one.
The rescaled, density plots look like:
ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) +
geom_freqpoly(mapping = aes(color = cut), binwidth = 500)
While patterns across price are now easier to see — “Fair” has a peak at $2500 that other cuts cannot beat — we have thrown away the number of observations that support these “trends.”
A much more useful plot is the boxplot. The boxplot is based upon the five-number-summary of a quantitative variable.
summary(diamonds$price) # R base
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326 950 2401 3933 5324 18823
fivenum(diamonds$price) # R stats
## [1] 326.0 950.0 2401.0 5324.5 18823.0
ggplot(data = diamonds) +
geom_boxplot(mapping = aes(y = price))
We see that the features of the boxplot match up with the five number summary. However, the x-axis is a bit confusing. We remove the scale by plotting against nothing.
ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = "", y = price)) + xlab("") + coord_flip()
Now that we recognize what the boxplot is trying to show us, it is time to plot individual boxplots for each of the levels of cut side by side. Use of coord_flip makes reading the “long” cut values easier.
ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = cut, y = price)) +
coord_flip()
It is strange that the price of “Ideal” cut diamonds is lower (on median) than other cuts. We also see that “Fair” cut have the highest median. Setting the price of a diamond must be more complex than can be explained by cut alone.
The text notes that cut is ordinal and that plotting in the “natural” order shows a pattern — although it is not the one we expected. In some situations we may wish to use the values of the response variable to order the values of the explanatory variable. The example of highway gas mileage as a function of vehicle class shows how to rearrange the boxplots by increasing response median. Using this technique on the diamond data may provide additional insight into the pricing.
ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = reorder(cut, price, FUN = median),
y = price
)
) +
coord_flip()
A variation of the boxplot, violin plots use density to provide a better idea of where points are within the quartiles.
ggplot(data = diamonds) +
geom_violin(mapping = aes(x = reorder(cut, price, FUN = median),
y = price
),
draw_quantiles = c(0.25, 0.5, 0.75)
) +
xlab("Cut") +
ylab("Price") +
coord_flip()
Overlaying violin plots and boxplots is the best of both worlds.
ggplot(data = diamonds) +
geom_violin(mapping = aes(x = cut,
y = price
)
) +
geom_boxplot(mapping = aes(x = cut,
y = price
),
width = 0.05, color="skyblue"
) +
xlab("Cut") +
ylab("Price ($US)") #+ coord_flip()
Analysis of categorical data is based upons counts. Whether we have one or two variables, we are still stuck looking at counts. The text suggests that we use geom_count. As in the text, we will continue to look at the cut and color variables from the diamonds data.
ggplot(data = diamonds) +
geom_count(mapping = aes(x = cut, y = color))
The authors use of the word “correlation” when looking for patterns in categorical data is incorrect. Correlation describes the strength and direction of a linear association between two numeric variables.
Circle sizes make it possible to look for patterns that exist in categorical data. However, there are more meaningful ways of looking for these relationships. Perhaps the most simple, and also most informative, is a table of counts or proportions.
### Counts by dplyr
diamonds %>%
count(color, cut)
## # A tibble: 35 × 3
## color cut n
## <ord> <ord> <int>
## 1 D Fair 163
## 2 D Good 662
## 3 D Very Good 1513
## 4 D Premium 1603
## 5 D Ideal 2834
## 6 E Fair 224
## 7 E Good 933
## 8 E Very Good 2400
## 9 E Premium 2337
## 10 E Ideal 3903
## # … with 25 more rows
## # ℹ Use `print(n = ...)` to see more rows
# or
diamonds %>%
group_by(color, cut)%>%
summarize(n=n())
## `summarise()` has grouped output by 'color'. You can override using the
## `.groups` argument.
## # A tibble: 35 × 3
## # Groups: color [7]
## color cut n
## <ord> <ord> <int>
## 1 D Fair 163
## 2 D Good 662
## 3 D Very Good 1513
## 4 D Premium 1603
## 5 D Ideal 2834
## 6 E Fair 224
## 7 E Good 933
## 8 E Very Good 2400
## 9 E Premium 2337
## 10 E Ideal 3903
## # … with 25 more rows
## # ℹ Use `print(n = ...)` to see more rows
# or, maybe easier to read
diamonds %>%
group_by(color, cut) %>%
summarise(n=n()) %>%
spread(cut, n)
## `summarise()` has grouped output by 'color'. You can override using the
## `.groups` argument.
## # A tibble: 7 × 6
## # Groups: color [7]
## color Fair Good `Very Good` Premium Ideal
## <ord> <int> <int> <int> <int> <int>
## 1 D 163 662 1513 1603 2834
## 2 E 224 933 2400 2337 3903
## 3 F 312 909 2164 2331 3826
## 4 G 314 871 2299 2924 4884
## 5 H 303 702 1824 2360 3115
## 6 I 175 522 1204 1428 2093
## 7 J 119 307 678 808 896
### Counts by base R
table(diamonds[,c("color", "cut")])
## cut
## color Fair Good Very Good Premium Ideal
## D 163 662 1513 1603 2834
## E 224 933 2400 2337 3903
## F 312 909 2164 2331 3826
## G 314 871 2299 2924 4884
## H 303 702 1824 2360 3115
## I 175 522 1204 1428 2093
## J 119 307 678 808 896
### Proportion by dplyr
(diam.dplyrprop <- diamonds%>%
group_by(cut, color)%>%
summarize(n=n())%>%
mutate(prop=n/sum(n))%>%
subset(select=c("cut","color","prop"))%>% #drop the frequency value
spread(cut, prop)
)
## `summarise()` has grouped output by 'cut'. You can override using the `.groups`
## argument.
## # A tibble: 7 × 6
## color Fair Good `Very Good` Premium Ideal
## <ord> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 D 0.101 0.135 0.125 0.116 0.132
## 2 E 0.139 0.190 0.199 0.169 0.181
## 3 F 0.194 0.185 0.179 0.169 0.178
## 4 G 0.195 0.178 0.190 0.212 0.227
## 5 H 0.188 0.143 0.151 0.171 0.145
## 6 I 0.109 0.106 0.0997 0.104 0.0971
## 7 J 0.0739 0.0626 0.0561 0.0586 0.0416
### Proportion by hand using base R
print(round(diam.tableprop <- table(diamonds[,c("color", "cut")])/nrow(diamonds), digits=3))
## cut
## color Fair Good Very Good Premium Ideal
## D 0.003 0.012 0.028 0.030 0.053
## E 0.004 0.017 0.044 0.043 0.072
## F 0.006 0.017 0.040 0.043 0.071
## G 0.006 0.016 0.043 0.054 0.091
## H 0.006 0.013 0.034 0.044 0.058
## I 0.003 0.010 0.022 0.026 0.039
## J 0.002 0.006 0.013 0.015 0.017
Note that the proportions should sum to one. Looking at the proportions in the first row of the dplyr table, we see that the sum is greater than one. For the table table, the sum of the entire table is one.
### Table sum of dplyr proportions
sum(diam.dplyrprop[-1]) # drop the label column (col 1)
## [1] 5
### Table sum of "table" proportions
sum(diam.tableprop)
## [1] 1
### Row sums of dplyr proportions
apply(diam.dplyrprop[-1], 1, sum)
## [1] 0.6091439 0.8785120 0.9047372 1.0014994 0.7979242 0.5154126 0.2927707
### Row sums of "table" proportions
apply(diam.tableprop, 1, sum)
## D E F G H I J
## 0.12560252 0.18162773 0.17690026 0.20934372 0.15394883 0.10051910 0.05205784
### Column sums of dplyr proportions
apply(diam.dplyrprop[-1], 2, sum)
## Fair Good Very Good Premium Ideal
## 1 1 1 1 1
### Column sums of "table" proportions
apply(diam.tableprop, 2, sum)
## Fair Good Very Good Premium Ideal
## 0.02984798 0.09095291 0.22398962 0.25567297 0.39953652
### Clean up
rm("diam.tableprop", "diam.dplyprop")
## Warning in rm("diam.tableprop", "diam.dplyprop"): object 'diam.dplyprop' not
## found
We see that the proportions generated by hand using table are table proportions. Those generated using dplyr are column proportions. Apparently, we need to be careful how we compute our proportions. Thankfully, table and a few associated functions make this easy.
### Generate counts table
(diam.table <- table(diamonds$cut, diamonds$color)
)
##
## D E F G H I J
## Fair 163 224 312 314 303 175 119
## Good 662 933 909 871 702 522 307
## Very Good 1513 2400 2164 2299 1824 1204 678
## Premium 1603 2337 2331 2924 2360 1428 808
## Ideal 2834 3903 3826 4884 3115 2093 896
# or use ftable on the table
ftable(diam.table)
## D E F G H I J
##
## Fair 163 224 312 314 303 175 119
## Good 662 933 909 871 702 522 307
## Very Good 1513 2400 2164 2299 1824 1204 678
## Premium 1603 2337 2331 2924 2360 1428 808
## Ideal 2834 3903 3826 4884 3115 2093 896
### Get marginal distribution of cut (rows = dim 1)
margin.table(diam.table, 1)
##
## Fair Good Very Good Premium Ideal
## 1610 4906 12082 13791 21551
### Get marginal distributon of color (columns = dim 2)
margin.table(diam.table, 2)
##
## D E F G H I J
## 6775 9797 9542 11292 8304 5422 2808
### Get table proportions
(temp <- prop.table(diam.table))
##
## D E F G H
## Fair 0.003021876 0.004152762 0.005784205 0.005821283 0.005617353
## Good 0.012272896 0.017296997 0.016852058 0.016147571 0.013014461
## Very Good 0.028049685 0.044493882 0.040118650 0.042621431 0.033815350
## Premium 0.029718205 0.043325918 0.043214683 0.054208380 0.043752317
## Ideal 0.052539859 0.072358176 0.070930664 0.090545050 0.057749351
##
## I J
## Fair 0.003244346 0.002206155
## Good 0.009677419 0.005691509
## Very Good 0.022321098 0.012569522
## Premium 0.026473860 0.014979607
## Ideal 0.038802373 0.016611049
sum(temp)
## [1] 1
### Get row proportions
(temp <- prop.table(diam.table, 1))
##
## D E F G H I
## Fair 0.10124224 0.13913043 0.19378882 0.19503106 0.18819876 0.10869565
## Good 0.13493681 0.19017530 0.18528333 0.17753771 0.14309009 0.10640033
## Very Good 0.12522761 0.19864261 0.17910942 0.19028307 0.15096838 0.09965238
## Premium 0.11623523 0.16945834 0.16902328 0.21202233 0.17112610 0.10354579
## Ideal 0.13150202 0.18110529 0.17753237 0.22662521 0.14454086 0.09711846
##
## J
## Fair 0.07391304
## Good 0.06257644
## Very Good 0.05611654
## Premium 0.05858893
## Ideal 0.04157580
apply(temp, 1, sum)
## Fair Good Very Good Premium Ideal
## 1 1 1 1 1
### Get column proportions
(temp <- prop.table(diam.table, 2))
##
## D E F G H I
## Fair 0.02405904 0.02286414 0.03269755 0.02780730 0.03648844 0.03227591
## Good 0.09771218 0.09523323 0.09526305 0.07713425 0.08453757 0.09627444
## Very Good 0.22332103 0.24497295 0.22678684 0.20359547 0.21965318 0.22205828
## Premium 0.23660517 0.23854241 0.24428841 0.25894439 0.28420039 0.26337145
## Ideal 0.41830258 0.39838726 0.40096416 0.43251860 0.37512042 0.38601992
##
## J
## Fair 0.04237892
## Good 0.10933048
## Very Good 0.24145299
## Premium 0.28774929
## Ideal 0.31908832
apply(temp, 2, sum)
## D E F G H I J
## 1 1 1 1 1 1 1
### Use CrossTable to get it all at once
if (!require(gmodels)){
install.packages("gmodels", dependencies=TRUE)
load(gmodels)
}
## Loading required package: gmodels
CrossTable(diamonds$cut, diamonds$color, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 53940
##
##
## | diamonds$color
## diamonds$cut | D | E | F | G | H | I | J | Row Total |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Fair | 163 | 224 | 312 | 314 | 303 | 175 | 119 | 1610 |
## | 0.101 | 0.139 | 0.194 | 0.195 | 0.188 | 0.109 | 0.074 | 0.030 |
## | 0.024 | 0.023 | 0.033 | 0.028 | 0.036 | 0.032 | 0.042 | |
## | 0.003 | 0.004 | 0.006 | 0.006 | 0.006 | 0.003 | 0.002 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Good | 662 | 933 | 909 | 871 | 702 | 522 | 307 | 4906 |
## | 0.135 | 0.190 | 0.185 | 0.178 | 0.143 | 0.106 | 0.063 | 0.091 |
## | 0.098 | 0.095 | 0.095 | 0.077 | 0.085 | 0.096 | 0.109 | |
## | 0.012 | 0.017 | 0.017 | 0.016 | 0.013 | 0.010 | 0.006 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Very Good | 1513 | 2400 | 2164 | 2299 | 1824 | 1204 | 678 | 12082 |
## | 0.125 | 0.199 | 0.179 | 0.190 | 0.151 | 0.100 | 0.056 | 0.224 |
## | 0.223 | 0.245 | 0.227 | 0.204 | 0.220 | 0.222 | 0.241 | |
## | 0.028 | 0.044 | 0.040 | 0.043 | 0.034 | 0.022 | 0.013 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Premium | 1603 | 2337 | 2331 | 2924 | 2360 | 1428 | 808 | 13791 |
## | 0.116 | 0.169 | 0.169 | 0.212 | 0.171 | 0.104 | 0.059 | 0.256 |
## | 0.237 | 0.239 | 0.244 | 0.259 | 0.284 | 0.263 | 0.288 | |
## | 0.030 | 0.043 | 0.043 | 0.054 | 0.044 | 0.026 | 0.015 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Ideal | 2834 | 3903 | 3826 | 4884 | 3115 | 2093 | 896 | 21551 |
## | 0.132 | 0.181 | 0.178 | 0.227 | 0.145 | 0.097 | 0.042 | 0.400 |
## | 0.418 | 0.398 | 0.401 | 0.433 | 0.375 | 0.386 | 0.319 | |
## | 0.053 | 0.072 | 0.071 | 0.091 | 0.058 | 0.039 | 0.017 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 6775 | 9797 | 9542 | 11292 | 8304 | 5422 | 2808 | 53940 |
## | 0.126 | 0.182 | 0.177 | 0.209 | 0.154 | 0.101 | 0.052 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
### Cleanup
rm("temp", "diam.table")
We now have counts and proportions that describe different relationships between two categorical variables. The book shows how to plot the frequencies using a “heat map.”
diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = n))
The mosaic package provides a more flexible method for dealing with categorical data. The package is capable of looking at more than two variables at a time.
### Set up colors to reflect that yellow diamonds are not prized
ramp <- colorRamp(c("white", "light yellow"))
diamond.colors <- rgb( ramp(seq(0, 1, length = 8)), max = 255)
mosaicplot(table(diamonds[,c("cut","color")]),
main="Diamonds",
col = diamond.colors
)
### Set up colors to reflect that inclusions are not prized
ramp <- colorRamp(c("black", "white"))
diamond.colors <- rgb( ramp(seq(0, 1, length = 5)), max = 255)
mosaicplot(table(diamonds$cut,
diamonds$color,
factor(substr(diamonds$clarity, 1,2),
levels=c("I1","SI","VS","VV","IF")
)
),
main="Diamonds",
col = diamond.colors
)
A word of warning. For very large data sets, table based methods can be slower than dplyr based approaches. Of course, the table methods are more predictable in their proportions and, once the base table is created, marginal distributions are quick.
The base plot for looking at the relationship between two continuous variables is the scatterplot. Continuing the diamond price analysis, we can check to see if size matters.
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price),
alpha = 0.01
)
One method for reducing data is the use of binning. Two methods — one based on rectangular bins and the other on hexagonal bins — are shown. Output from the methods are presented below.
The text suggests categorizing carat and plotting using boxplots. Hidden in the text is a suggestion to use varwidth=TRUE to compensate for different numbers of observations withing each of the bins.
ggplot(data=diamonds2) +
geom_bin2d(mapping = aes(x = carat, y = price),
bins = 50)
### Install or load hexbin
if (!require(hexbin)){
install.packages("hexbin", dependencies = TRUE)
require(hexbin)
}
## Loading required package: hexbin
ggplot(data=diamonds2) +
geom_hex(mapping = aes(x = carat, y = price),
bins=50
)
The text suggests categorizing carat and plotting using boxplots. Hidden in the text is a suggestion to use varwidth=TRUE to visualize the different number of observations within each of the bins.
### Equal width with variable number of observations
ggplot(data = diamonds2, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_width(carat, 0.2)),
varwidth = TRUE
)
### Equal number of observations with variable width
### Use 25 intervals
ggplot(data = diamonds2, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_number(carat, 25)),
varwidth = TRUE
)
As noted in the text, EDA is about looking for patterns which are indicative of relationships. The authors list a number of questions that they suggest you ask yourself should you find a pattern:
Some of these questions suggest that the relationship is causal. As noted earlier, causal inference requires random sampling from the population of interest. One should always be careful that the method of collection has not induced bias. Serial autocorrelation is often a “pattern” observed in data that are collected longitudinally — that is, over time.
One of the examples that the text provides is an attempt to explain the cost of diamonds. The authors choose to start by fitting a line through the scatterplot of price as a function of size that we viewed above. Fitting regression lines without a careful analysis of residuals is a dangerous practice.
p <- ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
geom_point(alpha = 0.01)
p
The raw points do not appear to be linearly associated. We can add a couple of smooths to check this.
p + geom_smooth(method = "lm", color = "blue") +
geom_smooth(method = "lm", formula = y ~ exp(x), color = "red") +
geom_smooth(method = "lm", formula = y ~ poly(x, 5), color = "green") +
coord_cartesian(ylim=c(0,40000))
## `geom_smooth()` using formula 'y ~ x'
Because we can, let us take a cursory look at the quality of the linear regression model. The easiest way to do this is to plot the residuals.
### Install or load the modelr package
if (!require(modelr)){
install.packages("modelr", dependencies = TRUE)
require(modelr)
}
## Loading required package: modelr
### Fit the linear model of price as a function of carat
price.lm.carat <- lm(price ~ carat, data = diamonds)
price.lm.carat # blue line above
##
## Call:
## lm(formula = price ~ carat, data = diamonds)
##
## Coefficients:
## (Intercept) carat
## -2256 7756
### Create new tibble with residuals added
diamonds2 <- diamonds %>%
add_residuals(price.lm.carat)
### Plot the residuals
ggplot(data = diamonds2, mapping = aes(x = carat, y = resid)) +
geom_point() +
geom_hline(yintercept = 0, color="lightblue")
It is pretty clear that the simple linear model is not a good fit for this data. The decreasing pattern that is apparent indicates that carat is not describing all of a diamond’s price.
Before fitting a line to the data, the authors transform both variables by taking logarithms. The new plot looks like.
p <- ggplot(data = diamonds, mapping = aes(x = log(carat), y = log(price))) +
geom_point(alpha = 0.01)
p
While the plot appears to be more linear than that of the non-transformed data, it maintains some banding and other “patterns” that need to be accounted for. Fitting a line to the log data produces the plot below.
p + geom_smooth(method = "lm", color = "blue")
## `geom_smooth()` using formula 'y ~ x'
The residuals from this model look like:
### Install or load the modelr package
if (!require(modelr)){
install.packages("modelr", dependencies = TRUE)
require(modelr)
}
### Fit the linear model of price as a function of carat
(logprice.lm.logcarat <- lm(log(price) ~ log(carat), data = diamonds))
##
## Call:
## lm(formula = log(price) ~ log(carat), data = diamonds)
##
## Coefficients:
## (Intercept) log(carat)
## 8.449 1.676
### Create new tibble with residuals added
diamonds2 <- diamonds %>%
add_residuals(logprice.lm.logcarat)
### Plot the residuals
ggplot(data = diamonds2, mapping = aes(x = log(carat), y = resid)) +
geom_point() +
geom_hline(yintercept = 0, color="lightblue")
The log-log model still seems to have some patterns. There appears to be some clumping and the variability seems to increase as log(carat) increases. Note that the authors have plotted the residuals using their natural scale — \(e^{\ln(y)} = y\) and they use carat and not log(carat) on the x-axis.
### Create a transformed variable
diamonds2 <- diamonds2 %>%
mutate(expresid = exp(resid))
### Plot the transformed residuals
ggplot(data = diamonds2, mapping = aes(x = carat, y = expresid)) +
geom_point() +
geom_hline(yintercept = 0, color="lightblue")
As the text notes, the transformed residuals suggest that smaller diamonds are being overpriced. The reasons for this may be held in other variables. The authors use boxplots to check the relationship between cut and price having corrected for carat . We can also look at color and clairty using the same approach.
p <- ggplot(diamonds2)
### Look at cut having corrected for carat
p + geom_boxplot(aes(x = cut, y = expresid))
### Look at color having corrected for carat
p + geom_boxplot(aes(x = color, y = expresid))
### Look at clarity having corrected for carat
p + geom_boxplot(aes(x = clarity, y = expresid))
All three variables seem to indicate higher prices associated with better quality once we have corrected for size.
On a programming note, as the book indicates we do not have to include argument names with the arguments — provided you enter the arguments in the correct order. To determine proper order you can enter “?function” or “args(function)” in the console.
args(ggplot)
## function (data = NULL, mapping = aes(), ..., environment = parent.frame())
## NULL
args(geom_point)
## function (mapping = NULL, data = NULL, stat = "identity", position = "identity",
## ..., na.rm = FALSE, show.legend = NA, inherit.aes = TRUE)
## NULL
Remember that someone else may have to read your code, so the inclusion of argument names may not be such a bad thing.
Another plot may help us see two variables at a time in a single plot.
p <- ggplot(diamonds, aes(x = carat, y = price, color = clarity)) +
geom_point(aes(shape = cut), alpha=0.1) +
geom_smooth(method = "lm")
p
## Warning: Using shapes for an ordinal variable is not advised
## `geom_smooth()` using formula 'y ~ x'
ggplot2 uses facets to maintain the lattice and trellis plots that were first used in R and S/Plus (respectively). With facets we can include additional variables in our plots.
p <- ggplot(diamonds, aes(x = carat, y = price, color = clarity)) +
geom_point(alpha=0.1) +
geom_smooth(method = "lm")
p + facet_grid(rows = vars(cut)) + theme(aspect.ratio = 1)
## `geom_smooth()` using formula 'y ~ x'
p + facet_grid(cols = vars(color)) + theme(aspect.ratio = 1)
## `geom_smooth()` using formula 'y ~ x'
p + facet_grid(rows = vars(cut), cols = vars(color)) + theme(aspect.ratio = 1)
## `geom_smooth()` using formula 'y ~ x'
## Warning in qt((1 - level)/2, df): NaNs produced
## Warning in qt((1 - level)/2, df): NaNs produced
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
p + facet_grid(rows = vars(cut, color %in% c("D","E","F"))) + theme(aspect.ratio = 0.75)
## `geom_smooth()` using formula 'y ~ x'
## Warning in qt((1 - level)/2, df): NaNs produced
## Warning in qt((1 - level)/2, df): no non-missing arguments to max; returning
## -Inf
We will look at fitting models in the future.