In this chapter, the authors cover importing data using the readr package. The focus of the chapter is on reading “flat” text files.
“Flat” text files are set up with separate rows for each observation and separate columns for different variables. The columns are separated by different “separators.” Typically, we see comma, space, tab, and semicolons as separators.
The most common form of data file is the CSV, or comma separated values file. Base R provides a number of ways of reading CSV files. The easiest to use of these is the read.csv function. This function can be used to read a file from your local drive or from the web. Below we read data on the heights and weights of individuals using both methods.
### Read a local file
htwt.local = read.csv("Data/htwt.csv")
head(htwt.local)
## Height Weight Group
## 1 64 159 1
## 2 63 155 2
## 3 67 157 2
## 4 60 125 1
## 5 52 103 2
## 6 58 122 2
attributes(htwt.local)
## $names
## [1] "Height" "Weight" "Group"
##
## $class
## [1] "data.frame"
##
## $row.names
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
### Read a web file
htwt.web = read.csv("http://facweb1.redlands.edu/fac/jim_bentley/Data/Math_260/Data/HtWt.csv")
head(htwt.web)
## Height Weight Group
## 1 64 159 1
## 2 63 155 2
## 3 67 157 2
## 4 60 125 1
## 5 52 103 2
## 6 58 122 2
attributes(htwt.web)
## $names
## [1] "Height" "Weight" "Group"
##
## $class
## [1] "data.frame"
##
## $row.names
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
htwt.names <- names(htwt.web)
R also reads data with different separators/delimiters. Files without headers can also be imported. Below we read a tab delimited file that does not have a header. The generated variable names (V1, V2, V3) are then replaced with the names used above (Height, Weight, Group)
args(read.csv)
## function (file, header = TRUE, sep = ",", quote = "\"", dec = ".",
## fill = TRUE, comment.char = "", ...)
## NULL
### Read a tab delimited file with no header
htwt.tab = read.csv("http://facweb1.redlands.edu/fac/jim_bentley/Data/MATH_260/Data/HtWtNoHeader.txt", sep = "\t", header = FALSE)
head(htwt.tab)
## V1 V2 V3
## 1 64 159 1
## 2 63 155 2
## 3 67 157 2
## 4 60 125 1
## 5 52 103 2
## 6 58 122 2
names(htwt.tab) <- htwt.names
head(htwt.tab)
## Height Weight Group
## 1 64 159 1
## 2 63 155 2
## 3 67 157 2
## 4 60 125 1
## 5 52 103 2
## 6 58 122 2
attributes(htwt.web)
## $names
## [1] "Height" "Weight" "Group"
##
## $class
## [1] "data.frame"
##
## $row.names
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
The export version of read.csv is write.csv. It provides similar options to the import function and generates local files.
As an example, we create a European ready version of the htwt data that use a semicolon as the delimiter.
#?write.csv2
### Make a copy of htwt with some decimal numbers and a factor version of group
htwt <- htwt.local
htwt$group <- factor(htwt$Group, labels=c("Male","Female"))
htwt$weight <- round(htwt$Weight + rnorm(length(htwt$Weight)), 1)
htwt$height <- round(htwt$Height + rnorm(length(htwt$Height)), 1)
### Save the file in Euro format --- semicolons and commas
write.csv2(htwt, "Data/htwtEuro.txt", row.names=FALSE)
### Read in non-Euro
htwt.euro <- read.csv("Data/htwtEuro.txt")
## Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names
### Read in Euro
htwt.euro <- read.csv2("Data/htwtEuro.txt")
head(htwt.euro)
## Height Weight Group group weight height
## 1 64 159 1 Male 161.4 62.7
## 2 63 155 2 Female 155.1 62.9
## 3 67 157 2 Female 158.6 66.0
## 4 60 125 1 Male 124.8 60.3
## 5 52 103 2 Female 101.5 52.2
## 6 58 122 2 Female 119.6 57.7
Take a look at the file Data/htwtEuro.txt to see why read.csv() failed. Something about too few commas? The moral of the story is that you need to get to know your data file before you read it.
p_load(readxl)
### Do a simple read of a file with a single sheet and first line header
noaa.ont.daily <- read_excel("Data/LennoxData/NOAA010113to071018.xlsx", guess_max = 2000)
dim(noaa.ont.daily)
## [1] 2017 42
### Read a multi-sheet file with some useless information above the header
### on the sheet we want.
noaa.ont.hourly <- read_excel("Data/LennoxData/NOAA_LCD_1526334.xlsx",
sheet = #2, ### Could use sheet number but sheets might move
"NOAA_LCD_1526334", ### Use sheet name instead
skip = 3, ### Skip header
guess_max = 25000
)
dim(noaa.ont.hourly)
## [1] 78269 90
### Grab more simple files with interesting information from SCE
SCE010117to010918 <- read_excel("Data/LennoxData/SCE010117to010918.xlsx", skip = 13)
dim(SCE010117to010918)
## [1] 10096 3
SCE060116to072517 <- read_excel("Data/LennoxData/SCE060116to072517.xlsx", skip = 13)
dim(SCE060116to072517)
## [1] 1483 3
SCE090117to092618 <- read_excel("Data/LennoxData/SCE090117to092618.xlsx", skip = 13)
dim(SCE090117to092618)
## [1] 10554 3
The SCE files overlap. We will see more efficient ways of merging these files in future sections. For now, let’s stack them and then get rid of the replicates.
### Bind rows, i.e. stack them
sce <- rbind(SCE010117to010918, SCE060116to072517)
sce <- rbind(sce, SCE090117to092618)
### Check to make sure that we haven't lost observations
dim(sce)
## [1] 22133 3
### Rename the variables to something easy to type
names(sce) <- c("Time_Period", "Usage_kwh", "Read_Quality")
### Take only the first of duplicated observations
sce.duped <- sce[duplicated(sce),]
sce.unique <- sce[!duplicated(sce),]
sce <- unique(sce)
dim(sce)
## [1] 15850 3
Take a graphical look at my usage over time.
plot(sce$Usage_kwh, type="l", ylab = "kwh", xlab = "Hour")
## Warning in xy.coords(x, y, xlabel, ylabel, log): NAs introduced by coercion
There are ways of making the times more meaningful. We will learn about working with time in the future.
The tidyverse package provides its own versions of functions for reading CSV files. These are supposed to be more efficient than the base R versions.
args(read_csv)
## function (file, col_names = TRUE, col_types = NULL, col_select = NULL,
## id = NULL, locale = default_locale(), na = c("", "NA"), quoted_na = TRUE,
## quote = "\"", comment = "", trim_ws = TRUE, skip = 0, n_max = Inf,
## guess_max = min(1000, n_max), name_repair = "unique", num_threads = readr_threads(),
## progress = show_progress(), show_col_types = should_show_types(),
## skip_empty_rows = TRUE, lazy = should_read_lazy())
## NULL
### Read a local file
htwt.local = read_csv("Data/htwt.csv")
## Rows: 20 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): Height, Weight, Group
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(htwt.local)
## # A tibble: 6 × 3
## Height Weight Group
## <dbl> <dbl> <dbl>
## 1 64 159 1
## 2 63 155 2
## 3 67 157 2
## 4 60 125 1
## 5 52 103 2
## 6 58 122 2
attributes(htwt.local)
## $row.names
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
##
## $names
## [1] "Height" "Weight" "Group"
##
## $spec
## cols(
## Height = col_double(),
## Weight = col_double(),
## Group = col_double()
## )
##
## $problems
## <pointer: 0x0000014369ab6b40>
##
## $class
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
### Read a tab delimited file with no header
htwt.tab = read_tsv("http://facweb1.redlands.edu/fac/jim_bentley/Data/MATH_260/Data/HtWtNoHeader.txt", col_names = FALSE)
## Rows: 20 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## dbl (3): X1, X2, X3
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(htwt.tab)
## # A tibble: 6 × 3
## X1 X2 X3
## <dbl> <dbl> <dbl>
## 1 64 159 1
## 2 63 155 2
## 3 67 157 2
## 4 60 125 1
## 5 52 103 2
## 6 58 122 2
names(htwt.tab) <- htwt.names
head(htwt.tab)
## # A tibble: 6 × 3
## Height Weight Group
## <dbl> <dbl> <dbl>
## 1 64 159 1
## 2 63 155 2
## 3 67 157 2
## 4 60 125 1
## 5 52 103 2
## 6 58 122 2
attributes(htwt.web)
## $names
## [1] "Height" "Weight" "Group"
##
## $class
## [1] "data.frame"
##
## $row.names
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
### Read a Euro file non-Euro
htwt.euro <- read_csv("Data/htwtEuro.txt")
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 20 Columns: 1
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Height;Weight;Group;group;weight;height
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(htwt.euro)
## # A tibble: 6 × 1
## `Height;Weight;Group;group;weight;height`
## <chr>
## 1 "64;159;1;\"Male\";161,4;62,7"
## 2 "63;155;2;\"Female\";155,1;62,9"
## 3 "67;157;2;\"Female\";158,6;66"
## 4 "60;125;1;\"Male\";124,8;60,3"
## 5 "52;103;2;\"Female\";101,5;52,2"
## 6 "58;122;2;\"Female\";119,6;57,7"
### Read it Euro
htwt.euro <- read_csv2("Data/htwtEuro.txt")
## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
## Rows: 20 Columns: 6── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (1): group
## dbl (5): Height, Weight, Group, weight, height
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(htwt.euro)
## # A tibble: 6 × 6
## Height Weight Group group weight height
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 64 159 1 Male 161. 62.7
## 2 63 155 2 Female 155. 62.9
## 3 67 157 2 Female 159. 66
## 4 60 125 1 Male 125. 60.3
## 5 52 103 2 Female 102. 52.2
## 6 58 122 2 Female 120. 57.7
We can check the speed of the two packages by using microbenchmark.
p_load(microbenchmark)
microbenchmark(
x <- read.csv("Data/htwt.csv"),
y <- read_csv("Data/htwt.csv"),
y <- read.csv("Data/htwt.csv"),
x <- read_csv("Data/htwt.csv"),
times = 5
)
## Rows: 20 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): Height, Weight, Group
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 20 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): Height, Weight, Group
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 20 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): Height, Weight, Group
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 20 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): Height, Weight, Group
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 20 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): Height, Weight, Group
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 20 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): Height, Weight, Group
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 20 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): Height, Weight, Group
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 20 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): Height, Weight, Group
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 20 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): Height, Weight, Group
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 20 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): Height, Weight, Group
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Unit: microseconds
## expr min lq mean median uq
## x <- read.csv("Data/htwt.csv") 415.5 433.7 454.60 434.9 494.4
## y <- read_csv("Data/htwt.csv") 32544.5 32575.1 34818.48 32974.4 35148.4
## y <- read.csv("Data/htwt.csv") 421.6 442.3 494.80 446.8 505.1
## x <- read_csv("Data/htwt.csv") 32042.9 32060.3 33489.22 33325.5 34514.2
## max neval cld
## 494.5 5 a
## 40850.0 5 b
## 658.2 5 a
## 35503.2 5 b
Maybe the tidyverse versions are better for larger files.
p_load(microbenchmark)
write_csv(noaa.ont.hourly, "Data/LennoxData/NOAA_LCD_1526334.csv")
microbenchmark(
x <- read.csv("Data/LennoxData/NOAA_LCD_1526334.csv"),
y <- read_csv("Data/LennoxData/NOAA_LCD_1526334.csv"),
times = 2
)
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 78269 Columns: 90
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): STATION, STATION_NAME, REPORTTPYE, HOURLYSKYCONDITIONS, HOURLYPRS...
## dbl (48): ELEVATION, LATITUDE, LONGITUDE, HOURLYVISIBILITY, HOURLYWindSpeed...
## lgl (24): DAILYAverageRelativeHumidity, DAILYAverageDewPointTemp, DAILYAver...
## dttm (1): DATE
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 78269 Columns: 90
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): STATION, STATION_NAME, REPORTTPYE, HOURLYSKYCONDITIONS, HOURLYPRS...
## dbl (48): ELEVATION, LATITUDE, LONGITUDE, HOURLYVISIBILITY, HOURLYWindSpeed...
## lgl (24): DAILYAverageRelativeHumidity, DAILYAverageDewPointTemp, DAILYAver...
## dttm (1): DATE
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Unit: milliseconds
## expr min lq
## x <- read.csv("Data/LennoxData/NOAA_LCD_1526334.csv") 1234.4156 1234.4156
## y <- read_csv("Data/LennoxData/NOAA_LCD_1526334.csv") 528.7638 528.7638
## mean median uq max neval cld
## 1431.2957 1431.2957 1628.1758 1628.1758 2 b
## 542.3958 542.3958 556.0278 556.0278 2 a