Chapter 1

Chapter 1 is about visualizing patterns in data. The authors choose to use ggplot2 for their graphing. Since Wickham designed ggplot2, this is not a surprise. However, there are graphics in base R and in the lattice package that are also useful — and sometimes easier to use. Where possible, we will look at all three of these approaches.

Data

The text makes use of the mpg data that comes with ggplot2. We will also make use of a couple of other data sets that need to be imported. For now the method of importing the data is not really important.

    ### ggplot2 loads the mpg data.  We can look at the first few observations.
    head(mpg, 7)
## # A tibble: 7 × 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…
## 7 audi         a4      3.1  2008     6 auto(av)   f        18    27 p     compa…
    mpg$class <- factor(mpg$class)
    summary(mpg$class)
##    2seater    compact    midsize    minivan     pickup subcompact        suv 
##          5         47         41         11         33         35         62
    ### Read dataframes from files on a server
    ###     Import the height and weight data for 20 individuals
    htwt = read.csv("http://facweb1.redlands.edu/fac/jim_bentley/Data/Math_260/Data/HtWt.csv")
    ###     Change numeric Group to factor group
    htwt$grp <- factor(htwt$Group, labels=c("Male","Female"))
    ###     Look at the first few observations
    head(htwt)
##   Height Weight Group    grp
## 1     64    159     1   Male
## 2     63    155     2 Female
## 3     67    157     2 Female
## 4     60    125     1   Male
## 5     52    103     2 Female
## 6     58    122     2 Female
    ###     Import the Titanic data
    Titanic = read.csv("http://facweb1.redlands.edu/fac/jim_bentley/Data/Math_260/Data/titanic.csv")
    ###     Change integers to factor variables
    Titanic <- within(Titanic, {Survival <- factor(SURVIVED, labels=c("Died","Survived"))})
    Titanic <- within(Titanic, {Age.Group<-factor(AGE,labels=c("Child","Adult"))})
    Titanic <- within(Titanic,{Sex <- factor(SEX, labels=c("Female","Male"))})
    Titanic <- within(Titanic,{Class<-factor(CLASS,labels=c("Crew","First","Second","Steerage"))})
    ###     Look at the first few observations
  head(Titanic)
##   CLASS AGE SEX SURVIVED Survival Age.Group  Sex Class
## 1     1   1   1        1 Survived     Adult Male First
## 2     1   1   1        1 Survived     Adult Male First
## 3     1   1   1        1 Survived     Adult Male First
## 4     1   1   1        1 Survived     Adult Male First
## 5     1   1   1        1 Survived     Adult Male First
## 6     1   1   1        1 Survived     Adult Male First
  summary(Titanic)
##      CLASS            AGE              SEX            SURVIVED    
##  Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.000   1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.:0.000  
##  Median :1.000   Median :1.0000   Median :1.0000   Median :0.000  
##  Mean   :1.369   Mean   :0.9505   Mean   :0.7865   Mean   :0.323  
##  3rd Qu.:3.000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.000  
##  Max.   :3.000   Max.   :1.0000   Max.   :1.0000   Max.   :1.000  
##      Survival    Age.Group        Sex            Class    
##  Died    :1490   Child: 109   Female: 470   Crew    :885  
##  Survived: 711   Adult:2092   Male  :1731   First   :325  
##                                             Second  :285  
##                                             Steerage:706  
##                                                           
## 

Scatterplots

The text starts by having us look at scatterplots of the mpg data. While we could jump into ggplot2, a quick look at traditional and lattice graphics may be informative. The variables available !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Base R

To generate a simple plot of highway mpg as a function of displacement we use

  plot(x=mpg$displ, y=mpg$hwy, ylab="Highway MPG", xlab="Engine Displacement", xlim=c(0,10))

This plot is very plain. We can add information about the class of the vehicle by using plot character color, size, or shape.

  plot(x=mpg$displ, y=mpg$hwy, col=as.factor(mpg$class))

  plot(x=mpg$displ, y=mpg$hwy, cex=as.numeric(as.factor(mpg$class))*0.25)

  plot(x=mpg$displ, y=mpg$hwy, pch=as.numeric(as.factor(mpg$class)))

lattice

Syntax for most of the lattice plot functions is very model like. For the scatterplot we use xyplot.

  xyplot(hwy ~ displ, data=mpg)

Again, the plot is pretty plain. As before, we can add a variable by modifying color, size, or character of the plotted points.

  summary(mpg$class)
##    2seater    compact    midsize    minivan     pickup subcompact        suv 
##          5         47         41         11         33         35         62
  xyplot(hwy ~ displ, data=mpg, group=class, auto.key=TRUE)

  xyplot(hwy ~ displ, data=mpg, group=class, cex=(1:7)*0.25, pch=1:7, col=1:7, 
         key = list(column=3, text = list(as.character(unique(mpg$class))), 
                   points = list(pch = 1:7, col = 1:7, cex=(1:7)*0.25), 
                   pch = 1:7, col = 1:7))

While adding a “group” variable is simple, modifying the plot character and getting a key takes a bit of work.

ggplot2

The simple plot uses the mpg data declared in the ggplot call. A call to geom_point adds points to the plot.

  ggplot(data = mpg) + ### Note that the + is on the line to be continued
     geom_point(mapping = aes(x = displ, y = hwy))

Again, the plot is pretty boring. So, we modify the plotted points.

  ### Use color to convey class
  ggplot(data = mpg) + 
     geom_point(mapping = aes(x = displ, y = hwy, color=class)) 

  ### Use size to convey class
  ggplot(data = mpg) + 
     geom_point(mapping = aes(x = displ, y = hwy, size=class), alpha=0.25) 
## Warning: Using size for a discrete variable is not advised.

The text messes with alpha and shape as well.

Note that the placement of statements within and outside of the aes aesthetic changes the behavior. Compare the color call above to the one below.

  ggplot(data = mpg) + 
     geom_point(mapping = aes(x = displ, y = hwy), color="maroon") 

We use only a color name here since it applies to all points rather than distinguishing the class.

Lattices or Facets

The authors indicate that the use of aesthetics is not always the best method for adding additional variables to a plot. They introduce facets as a method that is particularly well suited for use with categorical (factor) variables. They fail to indicate that lattice plots existed long before ggplot2 was created.

lattice

The original trellis (lattice or facet) plots were created by a group at ATT Bell labs. The treliis package was included with S and later S-Plus and its cousin lattice is available for R.

Returning to the mpg data, we can use latti to break up the plots of hwy as a function of displ by the class of vehicle.

  xyplot(hwy ~ displ | class, data = mpg)

  xyplot(hwy ~ displ | cyl * drv, data = mpg)

  xyplot(hwy ~ displ | as.factor(cyl) * drv, data = mpg, subset = (mpg$cyl != 5))

While the above are helpful when compared to the facet plots presented below, the plot below is historically more interesting. This plot looks at data that were studied by thousands of statistics students in their linear models classes. Few noted the anomaly that is readily apparent in the plot. See if you can find the “problem.”

dotplot(variety ~ yield | site, data = barley, groups = year,
        key = simpleKey(levels(barley$year), space = "right"),
        xlab = "Barley Yield (bushels/acre) ",
        aspect=.5, layout = c(1,6), ylab=NULL)

Note that lattice is smart enough to order the barley strains from lowest to highest yield along the y-axis. Note, too, that the sites are ordered from lowest to highest yield as we go up the lattices (latti?). What about year?

facets

The text shows how to use ggplot to generate the mpg plots presented above. They are included below for your convenience.

  p <- ggplot(data = mpg) + geom_point(aes(x=displ, y=hwy))
  p

  p + facet_wrap(~ class, nrow = 3)

  p + facet_wrap(~ cyl)

  p + facet_grid(. ~ cyl)

  p + facet_wrap(drv ~ cyl)

  p + facet_grid(drv ~ cyl)

Note the diference in the layout of the output generated by facet_wrap() and facet_grid(). We can also use ggplot to reproduce the barley dotplot. Beware, ggplot has a geom_dotplot() that corresponds to the more modern usage of “dotplot” that is closer to a barchart.

  p <- ggplot(barley) + geom_point(aes(x = yield, y  = variety, color = year))
  p

  p + facet_wrap(~ barley$site, ncol=1) + theme(aspect.ratio = 0.2)

Geometric Objects

The text next looks at adding geometric objects to a plot. It is generally easiest to use ggplot2 to incoporate lines and splines to a plot. However, for some geometric descriptors, we can use base and lattice plots. Interestingly, the text uses “loess” (locally weighted scatterplot smoothing) as an example. Loess is an extremely useful data smoother that uses locally weighted linear regressions to predict a response for each explanatory value.

Base R

Base R contains a lines() function that adds lines to a plot(). (There is an analogous text() function that adds text.) We can use lines to add a loess “line” of estimated values.

  lw1 <- loess(hwy ~ displ, data = mpg, span = 1/3)
  lw2 <- loess(hwy ~ displ, data = mpg, span = 2/3)
  plot(hwy ~ displ, data=mpg, pch=19, cex=0.5)
  j <- order(mpg$displ)
  lines(mpg$displ[j], lw1$fitted[j], col="red", lwd=2)
    lines(mpg$displ[j], lw2$fitted[j], col="blue", lwd=2)

This is nice, but it requires some work to generate. And, it isn’t particularly flexible. Maybe lattice is easier.

lattice

To add a loess smooth to lattice plots we need to make use of a panel. The nice thing about panels is that they work within lattices.

  xyplot(hwy ~ displ | drv, data = mpg, lwd=2, span = 2/3,
    panel=function(x, y, ...){
      panel.xyplot(x, y, ...)
      panel.loess(x, y, ...)
      panel.loess(x, y, span = 1/3, col = "red")
    }
  )
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## pseudoinverse used at 1.8
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## neighborhood radius 0.2
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## pseudoinverse used at 1.8
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## neighborhood radius 0.2
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## pseudoinverse used at 5.3
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## neighborhood radius 0.1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## There are other near singularities as well. 0.01
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## pseudoinverse used at 5.3
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## neighborhood radius 0.1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## There are other near singularities as well. 0.01
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## pseudoinverse used at 5.3
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## neighborhood radius 0.1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## There are other near singularities as well. 0.01
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## pseudoinverse used at 5.3
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## neighborhood radius 0.1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## There are other near singularities as well. 0.01
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## pseudoinverse used at 5.3
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## neighborhood radius 0.1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## There are other near singularities as well. 0.01

ggplot2

For adding lines and smooths, ggplot2 is probably the easiest of the three approaches. Additional features (like confidence bounds) are also easy to add.

    (p <- ggplot(data = mpg) +  geom_point(mapping = aes(x = displ, y = hwy, color=class)) + facet_wrap(~ drv))

    p + geom_smooth(aes(x = displ, y = hwy), method = "loess", se = FALSE, span = 1/3, color = "red")
## `geom_smooth()` using formula 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 2
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.2
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 1.3751e-17
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 0.04
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 3.784
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.816
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 9.8371e-17
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 0.01

     p + geom_smooth(aes(x = displ, y = hwy), method = "loess", se = FALSE, span = 1/3, color = "red") + geom_smooth(aes(x = displ, y = hwy), method = "loess", se = FALSE, span = 2/3, color = "black")
## `geom_smooth()` using formula 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 2
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.2
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 1.3751e-17
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 0.04
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 3.784
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.816
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 9.8371e-17
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 0.01
## `geom_smooth()` using formula 'y ~ x'

Note that a warning is thrown because of the lack of data caused by subsetting with span = 1/3. We can ask for confidence bounds (default) and set aesthetics to generate a number of different plots.

    p <- ggplot(data = mpg, aes(x = displ, y = hwy, color = drv)) +  geom_point()
    p + geom_smooth(span = 2/3)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

    p  + geom_smooth(method = "lm") + geom_smooth()
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Statistical Transformations

While base R and lattice plots can be coerced into using the ggplot2 idea of stat_X, if you really want to plot some function of the data, using ggplot2 is usually the simplest approach. We create a number of plots for comparison.

Base R

Making barcharts is fairly easy.

  tbl.drv <- table(mpg$drv)
  tbl.drv
## 
##   4   f   r 
## 103 106  25
  n <- sum(tbl.drv)
  n
## [1] 234
  barplot(tbl.drv, ylab = "Count")

  barplot(tbl.drv/n, ylab = "Proportion")

  barplot(tbl.drv/n*100, ylab = "Percent")

  barplot(table(mpg$cyl))

  mosaicplot(drv ~ cyl, data = mpg)

lattice

The lattice package also creates barplots. These plots are not the number of occurrences of the levels of a factor variable in the data frame. A “sneaky” way to get a traditional barchart is to make a histogram of a factor variable.

  histogram(~ as.factor(drv), data = mpg, type = "density", xlab = "Drive Type")

  barchart(tbl.drv/n, xlab = "Proption")

  tbl.drv.cyl <- table(mpg$drv, mpg$cyl)
  tbl.drv.cyl
##    
##      4  5  6  8
##   4 23  0 32 48
##   f 58  4 43  1
##   r  0  0  4 21
  barchart(tbl.drv.cyl, ylab = "Drive Type", auto.key = TRUE)

  ptbl.drv.cyl <- prop.table(tbl.drv.cyl, margin = 1)
  ptbl.drv.cyl
##    
##               4           5           6           8
##   4 0.223300971 0.000000000 0.310679612 0.466019417
##   f 0.547169811 0.037735849 0.405660377 0.009433962
##   r 0.000000000 0.000000000 0.160000000 0.840000000
  barchart(ptbl.drv.cyl*100, ylab="Drive Type", xlab = "Percent", auto.key = TRUE)

ggplot2

So, now we will see that ggplot2 makes plotting strange plots with unusual statistics easier than base R or lattice. A few of the plots from the text will demonstrate the flexibility of ggplot2.

  ggplot(data = mpg) + geom_bar(aes(x = drv))

  ggplot(data = mpg) + stat_count(aes(x = drv))

  ggplot(data = mpg) + geom_bar(aes(x = drv, y = ..prop.., group = 1))

  ggplot(data = mpg) + geom_bar(aes(x = drv, y = (..prop..)*100, group = 1)) + ylab("Percent")

  fun.q1 <- function(x){quantile(x, 0.25)}
  fun.q3 <- function(x){quantile(x, 0.75)}
  fun.ci95lo <- function(x){mean(x)-1.96*sd(x)/sqrt(length(x))}
  fun.ci95hi <- function(x){mean(x)+1.96*sd(x)/sqrt(length(x))}
  
  ggplot(data = mpg) + geom_point(aes(x = drv, y = hwy)) + stat_summary(aes(x = drv, y = hwy),
                                     fun.ymin = fun.ci95lo,
                                     fun.ymax = fun.ci95hi,
                                     fun.y = median, 
                                     col = "red",
                                     pch = 3,
                                     lwd = 1.25
                                    )
## Warning: `fun.y` is deprecated. Use `fun` instead.
## Warning: `fun.ymin` is deprecated. Use `fun.min` instead.
## Warning: `fun.ymax` is deprecated. Use `fun.max` instead.

Position Adjustments

Wickham seems to find position adjustments to be novel. ggplot2 uses a position adjustment to stack barcharts. It also uses a position adjustment to “jitter” scatterplots.

Base R

Base R is happy to stack bars. Be careful about which proportions are being used. There is a huge conceptual difference between a table proportion and a row or column proportion.

  n <- length(mpg$drv)
  tbl.drv.cyl <- table(mpg$drv, mpg$cyl)
  addmargins(tbl.drv.cyl)
##      
##         4   5   6   8 Sum
##   4    23   0  32  48 103
##   f    58   4  43   1 106
##   r     0   0   4  21  25
##   Sum  81   4  79  70 234
  barplot(tbl.drv.cyl, ylab="Freq")

  barplot(tbl.drv.cyl/n, ylab="Proportion")

  barplot(prop.table(tbl.drv.cyl, margin = 2), ylab="Proportion")

Base R contains a jitter function that allows us to “jitter” any scatter plot. Note that the displ values in the mpg data frame are duplicated. We may wish to jitter these values.

  plot(x=jitter(mpg$displ,factor=2), y=mpg$hwy)

lattice

Stacked barcharts can be created using lattice. The data manipulations needed to get the data into a shape that is conducive to stacking are something that we will address later. So, we will postpone using lattice for stacked barcharts until later.

jitter can be used in xyplot.

  xyplot(hwy ~ displ, data=mpg)

  xyplot(hwy ~ jitter(displ, 1.5), data=mpg)

ggplot2

The text gives examples of stacking using the diamonds data. We will use our old mpg data.

  p <- ggplot(data = mpg)
  p + geom_bar(aes(x = drv, color = drv))

  p + geom_bar(aes(x = drv, fill = drv))

  p + geom_bar(aes(x = drv, fill = cyl))

  p + geom_bar(aes(x = drv, fill = as.factor(cyl)))

  p + geom_bar(aes(x = drv, fill = as.factor(cyl)), position = "fill")

  p + geom_bar(aes(x = drv, fill = as.factor(cyl)), position = "dodge")

Note that the authors’ approach to side-by-side barcharts fails if there are not observations with every combination of the x and fill variables.

Jittering works similarly to base R and lattice.

  ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) 

   ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), position = "jitter") 

   ggplot(data = mpg) + geom_jitter(mapping = aes(x = displ, y = hwy), width = 0.5) 

Coordinate Systems

The text shows how ggplot2 makes changing coordinate systems easy. Within base R and lattice graphics, coordinate changes are done by data transformation. A couple quick examples will make apparent the power of the ability to change coordinate systems on the fly.

ggplot2

  p <- ggplot(data = mpg) + geom_bar(aes(x = drv, fill = as.factor(cyl)), position = "fill")
  p 

  p + coord_flip()

  p <- ggplot(data = mpg) + geom_boxplot(aes(x = drv, y = hwy))
  p                                      

  p + coord_flip() + geom_hline(yintercept = c(mean(mpg$hwy),median(mpg$hwy)), lty = 2:3)

The Layered Grammar of Graphics

This section reflects Wickham’s view of graphics. If you really want to understand ggplot2, you should read the text.