Notes for R for Data Sience — Wickham & Grolemund

Chapter 8

In this chapter, the authors cover importing data using the readr package. The focus of the chapter is on reading “flat” text files.

Getting Started

“Flat” text files are set up with separate rows for each observation and separate columns for different variables. The columns are separated by different “separators.” Typically, we see comma, space, tab, and semicolons as separators.

Base R

The most common form of data file is the CSV, or comma separated values file. Base R provides a number of ways of reading CSV files. The easiest to use of these is the read.csv function. This function can be used to read a file from your local drive or from the web. Below we read data on the heights and weights of individuals using both methods.

  ### Read a local file
  htwt.local = read.csv("Data/htwt.csv")
  head(htwt.local)

##   Height Weight Group
## 1     64    159     1
## 2     63    155     2
## 3     67    157     2
## 4     60    125     1
## 5     52    103     2
## 6     58    122     2

  attributes(htwt.local)

## $names
## [1] "Height" "Weight" "Group" 
## 
## $class
## [1] "data.frame"
## 
## $row.names
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

  ### Read a web file
  htwt.web = read.csv("http://facweb1.redlands.edu/fac/jim_bentley/Data/Math_260/Data/HtWt.csv")
  head(htwt.web)

##   Height Weight Group
## 1     64    159     1
## 2     63    155     2
## 3     67    157     2
## 4     60    125     1
## 5     52    103     2
## 6     58    122     2

  attributes(htwt.web)

## $names
## [1] "Height" "Weight" "Group" 
## 
## $class
## [1] "data.frame"
## 
## $row.names
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

  htwt.names <- names(htwt.web)

R also reads data with different separators/delimiters. Files without headers can also be imported. Below we read a tab delimited file that does not have a header. The generated variable names (V1, V2, V3) are then replaced with the names used above (Height, Weight, Group)

  args(read.csv)

## function (file, header = TRUE, sep = ",", quote = "\"", dec = ".", 
##     fill = TRUE, comment.char = "", ...) 
## NULL

  ### Read a tab delimited file with no header
  htwt.tab = read.csv("http://facweb1.redlands.edu/fac/jim_bentley/Data/MATH_260/Data/HtWtNoHeader.txt", sep = "\t", header = FALSE)
  head(htwt.tab)

##   V1  V2 V3
## 1 64 159  1
## 2 63 155  2
## 3 67 157  2
## 4 60 125  1
## 5 52 103  2
## 6 58 122  2

  names(htwt.tab) <- htwt.names
  head(htwt.tab)

##   Height Weight Group
## 1     64    159     1
## 2     63    155     2
## 3     67    157     2
## 4     60    125     1
## 5     52    103     2
## 6     58    122     2

  attributes(htwt.web)

## $names
## [1] "Height" "Weight" "Group" 
## 
## $class
## [1] "data.frame"
## 
## $row.names
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

Exporting CSV

The export version of read.csv is write.csv. It provides similar options to the import function and generates local files.

As an example, we create a European ready version of the htwt data that use a semicolon as the delimiter.

  #?write.csv2
  
  ### Make a copy of htwt with some decimal numbers and a factor version of group
  htwt <- htwt.local
  htwt$group <- factor(htwt$Group, labels=c("Male","Female"))
  htwt$weight <- round(htwt$Weight + rnorm(length(htwt$Weight)), 1)
  htwt$height <- round(htwt$Height + rnorm(length(htwt$Height)), 1)
  
  ### Save the file in Euro format --- semicolons and commas
  write.csv2(htwt, "Data/htwtEuro.txt", row.names=FALSE)
  
  ### Read in non-Euro
  htwt.euro <- read.csv("Data/htwtEuro.txt")

## Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names

  ### Read in Euro
  htwt.euro <- read.csv2("Data/htwtEuro.txt")
  head(htwt.euro)

##   Height Weight Group  group weight height
## 1     64    159     1   Male  161.4   62.7
## 2     63    155     2 Female  155.1   62.9
## 3     67    157     2 Female  158.6   66.0
## 4     60    125     1   Male  124.8   60.3
## 5     52    103     2 Female  101.5   52.2
## 6     58    122     2 Female  119.6   57.7

Take a look at the file Data/htwtEuro.txt to see why read.csv() failed. Something about too few commas? The moral of the story is that you need to get to know your data file before you read it.

Importing XLSX

  p_load(readxl)

  ### Do a simple read of a file with a single sheet and first line header
  noaa.ont.daily <- read_excel("Data/LennoxData/NOAA010113to071018.xlsx", guess_max = 2000)
  dim(noaa.ont.daily)

## [1] 2017   42

  ### Read a multi-sheet file with some useless information above the header
  ### on the sheet we want.
  noaa.ont.hourly <- read_excel("Data/LennoxData/NOAA_LCD_1526334.xlsx",
                      sheet = #2,  ### Could use sheet number but sheets might move
                      "NOAA_LCD_1526334", ### Use sheet name instead
                      skip = 3,  ### Skip header
                      guess_max = 25000
                      )
  dim(noaa.ont.hourly)

## [1] 78269    90

  ### Grab more simple files with interesting information from SCE
  SCE010117to010918 <- read_excel("Data/LennoxData/SCE010117to010918.xlsx", skip = 13)
  dim(SCE010117to010918)

## [1] 10096     3

  SCE060116to072517 <- read_excel("Data/LennoxData/SCE060116to072517.xlsx", skip = 13)
  dim(SCE060116to072517)

## [1] 1483    3

  SCE090117to092618 <- read_excel("Data/LennoxData/SCE090117to092618.xlsx", skip = 13)
  dim(SCE090117to092618)

## [1] 10554     3

The SCE files overlap. We will see more efficient ways of merging these files in future sections. For now, let’s stack them and then get rid of the replicates.

  ### Bind rows, i.e. stack them
  sce <- rbind(SCE010117to010918, SCE060116to072517)
  sce <- rbind(sce, SCE090117to092618)
  
  ### Check to make sure that we haven't lost observations
  dim(sce)

## [1] 22133     3

  ### Rename the variables to something easy to type
  names(sce) <- c("Time_Period", "Usage_kwh", "Read_Quality")
  
  ### Take only the first of duplicated observations
  sce.duped <- sce[duplicated(sce),]
  sce.unique <- sce[!duplicated(sce),]
  sce <- unique(sce)
  dim(sce)

## [1] 15850     3

Take a graphical look at my usage over time.

  plot(sce$Usage_kwh, type="l", ylab = "kwh", xlab = "Hour")

## Warning in xy.coords(x, y, xlabel, ylabel, log): NAs introduced by coercion

There are ways of making the times more meaningful. We will learn about working with time in the future.

Importing Using readr

The tidyverse package provides its own versions of functions for reading CSV files. These are supposed to be more efficient than the base R versions.

  args(read_csv)

## function (file, col_names = TRUE, col_types = NULL, col_select = NULL, 
##     id = NULL, locale = default_locale(), na = c("", "NA"), quoted_na = TRUE, 
##     quote = "\"", comment = "", trim_ws = TRUE, skip = 0, n_max = Inf, 
##     guess_max = min(1000, n_max), name_repair = "unique", num_threads = readr_threads(), 
##     progress = show_progress(), show_col_types = should_show_types(), 
##     skip_empty_rows = TRUE, lazy = should_read_lazy()) 
## NULL

  ### Read a local file
  htwt.local = read_csv("Data/htwt.csv")

## Rows: 20 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): Height, Weight, Group
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

  head(htwt.local)

## # A tibble: 6 × 3
##   Height Weight Group
##    <dbl>  <dbl> <dbl>
## 1     64    159     1
## 2     63    155     2
## 3     67    157     2
## 4     60    125     1
## 5     52    103     2
## 6     58    122     2

  attributes(htwt.local)

## $row.names
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
## 
## $names
## [1] "Height" "Weight" "Group" 
## 
## $spec
## cols(
##   Height = col_double(),
##   Weight = col_double(),
##   Group = col_double()
## )
## 
## $problems
## <pointer: 0x0000014369ab6b40>
## 
## $class
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

  ### Read a tab delimited file with no header
  htwt.tab = read_tsv("http://facweb1.redlands.edu/fac/jim_bentley/Data/MATH_260/Data/HtWtNoHeader.txt", col_names = FALSE)

## Rows: 20 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## dbl (3): X1, X2, X3
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

  head(htwt.tab)

## # A tibble: 6 × 3
##      X1    X2    X3
##   <dbl> <dbl> <dbl>
## 1    64   159     1
## 2    63   155     2
## 3    67   157     2
## 4    60   125     1
## 5    52   103     2
## 6    58   122     2

  names(htwt.tab) <- htwt.names
  head(htwt.tab)

## # A tibble: 6 × 3
##   Height Weight Group
##    <dbl>  <dbl> <dbl>
## 1     64    159     1
## 2     63    155     2
## 3     67    157     2
## 4     60    125     1
## 5     52    103     2
## 6     58    122     2

  attributes(htwt.web)

## $names
## [1] "Height" "Weight" "Group" 
## 
## $class
## [1] "data.frame"
## 
## $row.names
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

  ### Read a Euro file non-Euro
  htwt.euro <- read_csv("Data/htwtEuro.txt")

## Warning: One or more parsing issues, see `problems()` for details

## Rows: 20 Columns: 1
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Height;Weight;Group;group;weight;height
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

  head(htwt.euro)

## # A tibble: 6 × 1
##   `Height;Weight;Group;group;weight;height`
##   <chr>                                    
## 1 "64;159;1;\"Male\";161,4;62,7"           
## 2 "63;155;2;\"Female\";155,1;62,9"         
## 3 "67;157;2;\"Female\";158,6;66"           
## 4 "60;125;1;\"Male\";124,8;60,3"           
## 5 "52;103;2;\"Female\";101,5;52,2"         
## 6 "58;122;2;\"Female\";119,6;57,7"

  ### Read it Euro
  htwt.euro <- read_csv2("Data/htwtEuro.txt")

## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
## Rows: 20 Columns: 6── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (1): group
## dbl (5): Height, Weight, Group, weight, height
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

  head(htwt.euro)

## # A tibble: 6 × 6
##   Height Weight Group group  weight height
##    <dbl>  <dbl> <dbl> <chr>   <dbl>  <dbl>
## 1     64    159     1 Male     161.   62.7
## 2     63    155     2 Female   155.   62.9
## 3     67    157     2 Female   159.   66  
## 4     60    125     1 Male     125.   60.3
## 5     52    103     2 Female   102.   52.2
## 6     58    122     2 Female   120.   57.7

We can check the speed of the two packages by using microbenchmark.

  p_load(microbenchmark)
  microbenchmark(
    x <- read.csv("Data/htwt.csv"),
    y <- read_csv("Data/htwt.csv"),
    y <- read.csv("Data/htwt.csv"),
    x <- read_csv("Data/htwt.csv"),
    times = 5
  )

## Rows: 20 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): Height, Weight, Group
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 20 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): Height, Weight, Group
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 20 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): Height, Weight, Group
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 20 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): Height, Weight, Group
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 20 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): Height, Weight, Group
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 20 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): Height, Weight, Group
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 20 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): Height, Weight, Group
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 20 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): Height, Weight, Group
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 20 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): Height, Weight, Group
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 20 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): Height, Weight, Group
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## Unit: microseconds
##                            expr     min      lq     mean  median      uq
##  x <- read.csv("Data/htwt.csv")   415.5   433.7   454.60   434.9   494.4
##  y <- read_csv("Data/htwt.csv") 32544.5 32575.1 34818.48 32974.4 35148.4
##  y <- read.csv("Data/htwt.csv")   421.6   442.3   494.80   446.8   505.1
##  x <- read_csv("Data/htwt.csv") 32042.9 32060.3 33489.22 33325.5 34514.2
##      max neval cld
##    494.5     5  a 
##  40850.0     5   b
##    658.2     5  a 
##  35503.2     5   b

Maybe the tidyverse versions are better for larger files.

  p_load(microbenchmark)
  write_csv(noaa.ont.hourly, "Data/LennoxData/NOAA_LCD_1526334.csv")
  microbenchmark(
    x <- read.csv("Data/LennoxData/NOAA_LCD_1526334.csv"),
    y <- read_csv("Data/LennoxData/NOAA_LCD_1526334.csv"),
    times = 2
  )

## Warning: One or more parsing issues, see `problems()` for details

## Rows: 78269 Columns: 90
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (17): STATION, STATION_NAME, REPORTTPYE, HOURLYSKYCONDITIONS, HOURLYPRS...
## dbl  (48): ELEVATION, LATITUDE, LONGITUDE, HOURLYVISIBILITY, HOURLYWindSpeed...
## lgl  (24): DAILYAverageRelativeHumidity, DAILYAverageDewPointTemp, DAILYAver...
## dttm  (1): DATE
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## Warning: One or more parsing issues, see `problems()` for details

## Rows: 78269 Columns: 90
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (17): STATION, STATION_NAME, REPORTTPYE, HOURLYSKYCONDITIONS, HOURLYPRS...
## dbl  (48): ELEVATION, LATITUDE, LONGITUDE, HOURLYVISIBILITY, HOURLYWindSpeed...
## lgl  (24): DAILYAverageRelativeHumidity, DAILYAverageDewPointTemp, DAILYAver...
## dttm  (1): DATE
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## Unit: milliseconds
##                                                   expr       min        lq
##  x <- read.csv("Data/LennoxData/NOAA_LCD_1526334.csv") 1234.4156 1234.4156
##  y <- read_csv("Data/LennoxData/NOAA_LCD_1526334.csv")  528.7638  528.7638
##       mean    median        uq       max neval cld
##  1431.2957 1431.2957 1628.1758 1628.1758     2   b
##   542.3958  542.3958  556.0278  556.0278     2  a