Preface

The text helps with the installation process. It also discusses the authors’ learning philosophy.

RStudio Server

RStudio Server is available on campus at http://rstudio.redlands.edu. You will need to be very patient the first time you log in as the server will have to initialize your account. Following your login, you will need to install packages — see below.

Installing R/Studio on Your Machine

Installing R and RStudio on your own machine will make it possible for you to work offline and off campus. Many of you will find this more convenient than trying to use RStudio Server.

The book outlines the steps required to install “everything” needed to reproduce the examples and solve the homework problems. The notes below add a couple of packages that help with some non-book items.

R

You need to install R before you install RStudio. As noted in the text, you can pick it up from the CRAN (http://cran.r-project.org). Binaries for Windows, OS X, and Linux are available.

Windows

Installation under Windows is trivial. Download the .EXE file and run it.

OS X

As noted on the CRAN, installation under OS X is very version dependent. More recent versions of OS X do not include XQuartz (https://www.xquartz.org/). Among other things, XQuartz is needed for tcltk which is used by some of the packages that we will see. XQuartz needs to be installed before you install R.

Linux

The installation instructions on the CRAN are pretty good. If you have questions, I’ll do my best to answer them.

RStudio

Once you have installed R, you will need to install RStudio. RStudio Desktop can be downloaded for free from (https://www.rstudio.com/products/rstudio/download/). Linux users who are extra motivated can install the free Server version.

Running R

The text runs R through RStudio. While R code can be submitted from the command line or from an R session, the use of RStudio provides a few advantages. We will address these advantages later. For now we will follow the text’s suggestion of using RStudio to submit R commands through the console (the lower left pane).

Packages

Base R has had a number of packages written for it that extend its capabilities. These packages need to be installed and then loaded to make them available.

tidyverse

Wickham has made abailable on the CRAN a set of packages collectively known as the tidyverse. The “tidyverse” refers to packages that aid in collecting, cleaning, and analyzing data.

As indicated in the text, the tidyverse package can be installed using the install.packages function.

  ### What are the arguments to install.packages?
  args(install.packages)
## function (pkgs, lib, repos = getOption("repos"), contriburl = contrib.url(repos, 
##     type), method, available = NULL, destdir = NULL, dependencies = NA, 
##     type = getOption("pkgType"), configure.args = getOption("configure.args"), 
##     configure.vars = getOption("configure.vars"), clean = FALSE, 
##     Ncpus = getOption("Ncpus", 1L), verbose = getOption("verbose"), 
##     libs_only = FALSE, INSTALL_opts, quiet = FALSE, keep_outputs = FALSE, 
##     ...) 
## NULL
  ### Help me
  # ?install.packages
  ### Just install the package(s) already.  Installing dependencies is helpful if
  ### a needed package has not been installed.  The conditional checks to see if
  ### the package has already been installed.
  if(!require(tidyverse)) install.packages("tidyverse",repos = "http://cran.us.r-project.org", dependencies = TRUE)
## Loading required package: tidyverse
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.5     v dplyr   1.0.3
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Once the package has been installed it need not be reinstalled unless you need to update it. However, to use the package it must be loaded in each new session prior to calling its functions.

  ### Make available the previously installed tidyverse packages
  library(tidyverse)

Note that in install.packages you are passing the name of the package. You should pass the name as a string. Loading the library/package using the library function passes a library. In some cases, loading the tidyverse will load functions that mask previously installed functions.

pacman

The pacman is a PACkage MANager that makes installing and loading packages easy. The p_load function in pacman will install packages prior to loading them. The function is also smart enough to not load a package that has already been loaded. Remember this if you are creating your own packages and are attempting to load an updated package.

   ### Install and load the pacman package for later use
   if(!require(pacman)) install.packages("pacman",repos = "http://cran.us.r-project.org", dependencies = TRUE)
## Loading required package: pacman
   library(pacman)

Other Packages

The text makes use of three data sets that are not included in the tidyverse package. All three of these data sets are under constant revision and their maintenance is carried out by individuals/groups other than Wickham. We can use p_load to install and load these packages.

  ### Use pacman's p_load to install and load data sets.
  ### The c function creates a list of string names.
  p_load(char = c("nycflights13","gapminder","Lahman"))
  ### Check to see if the packages were loaded.
  search()
##  [1] ".GlobalEnv"           "package:Lahman"       "package:gapminder"   
##  [4] "package:nycflights13" "package:pacman"       "package:forcats"     
##  [7] "package:stringr"      "package:dplyr"        "package:purrr"       
## [10] "package:readr"        "package:tidyr"        "package:tibble"      
## [13] "package:ggplot2"      "package:tidyverse"    "package:stats"       
## [16] "package:graphics"     "package:grDevices"    "package:utils"       
## [19] "package:datasets"     "package:methods"      "Autoloads"           
## [22] "package:base"
  ### An equivalent approach is to create a variable that contains the list
  ### and then pass this to the p_load function
  pacs <- c("nycflights13", "gapminder", "Lahman")
  pacs
## [1] "nycflights13" "gapminder"    "Lahman"
  p_load(char=pacs)

By the way, if you find you are running out of RAM, unloading a package or two can sometimes help. The p_unload function makes this easy.

  ### Free RAM by unloading the Lahman baseball package.
  p_unload(Lahman)
## The following packages have been unloaded:
## Lahman

R Programming

R is a vectorized, object oriented programming language. The focus of this course in on data science and not programming. Becuase of this, we will be ignoring many of the interesting features of R. However, the class will provide a reasonable foundation in R programming for those who wish to learn more about the language.

Before we move on there are a couple of things to note about R. First, it is an interpreted and not a compiled language. Because it is not compiled, its error statements are sometimes not the most helpful.

Second, because R is not compiled, its speed is related to the proper use of machine optimized, vectorized functions and the use of vectorization instead of looping. The functions sum and apply are prefered to functions like for and while. The use of matrix operations is also at time preferred.

  ### Use pacman to load the microbenchmark package
  p_load(microbenchmark)
 
  ### Create a boring data vector, x
  x <- matrix(0:10000, ncol=1, byrow=TRUE)
  
  ### Create a mysum function to add the elements of a vector
  mysum <- function(x){
    y <- 0
    for(i in 1:nrow(x)){
      y <- y + x[i]
    }
    return(y)
  }
  
  ### Compare performance using the microbenchmark
  microbenchmark(
  
    sum(x),
    t(x) %*% rep(1,nrow(x)),
    mysum(x)
  )
## Unit: microseconds
##                      expr   min     lq    mean median    uq    max neval
##                    sum(x)   6.5   6.75   7.639   7.00   7.4   15.5   100
##  t(x) %*% rep(1, nrow(x)) 117.4 138.35 152.364 144.35 157.1  253.0   100
##                  mysum(x) 372.6 378.70 449.942 384.05 394.4 4140.2   100

Google is a good source for information on optimizing your R code. Sites like StackExachange and StackOverflow also provide useful information on R programming. Hadley’s “Advanced R” text site, http://adv-r.had.co.nz/Performance.html, has a very good discussion of why R is not the best choice for fast simulation. Be sure to read the “Getting Help and Learning More” before submitting your question or you will be reminded of proper etiquette by the regulars.

R Markdown (RMD) Files

To this point we have followed the text in using the console. For now this is acceptable. However, at some point we will want to create documents or presentations. R Studio makes this realatively easy through the use of R markdown files. Preface.Rmd is an example of a markdown file. It can be “knitted” to generate HTML or DOCX files.

A base R markdown file (RMD) can be created within RStudio by clicking on “File” then “New File” followed by “R Markdown”. This file can be modified and then “knitted.” By the way, the button under “File” shortcuts this process.